Date | Commit message (Collapse) |
|
Noticed while tracking down fast-import crash bug report.
Link: https://public-inbox.org/meta/CAL_JsqK7P4gjLPyvzxNEcYmxT4j6Ah5f3Pz1RqDHxmysTg3aEg@mail.gmail.com/
|
|
These may've been causing strange errors[1] in t/imapd.t from
the -watch daemon, such as:
Cannot copy to HASH in scalar assignment ../PublicInbox/Over.pm
in the Over->dbh() sub. I've only noticed this failure on
FreeBSD 13.2 (Perl 5.32.1, DBD::SQLite 1.72 (bundled SQLite
3.39.4), DBI 1.643) so far, so it could also be something to do
with the versions used and/or memory layout differences
with libc or build toolchain.
|
|
Unlike `die', `croak' can be expanded to `confess' to give a
full backtrace. We'll use `confess' on transaction failures
since that occasionally causes sporadic t/imapd.t failures on
FreeBSD (IO::Kqueue is installed, so signals are deferred).
|
|
We can't use $DBD::SQLite::sqlite_version_number with older versions of
DBD::SQLite. Thus we need to treat the $DBD::SQLite::sqlite_version
string (e.g. "3.8.3", not v-string) and convert it to a v-string with
eval for version comparisons to determine if we can fork multiple
children when using SQLite.
Fixes: fa04201baae9 ("lei: force --jobs=1,1 for SQLite < 3.8.3")
|
|
This is like more familiar to readers of TAP (Test Anywhere
Protocol) output, as well as shell and Perl scripters which also
use `#' for comments.
AFAIK, nobody is parsing our stderr, and I'm not sure how
standardized the `I:' prefix is (nor `W:' and `E:' are). It's
already the prevailing style in Lei* code, too, so things have
been moving in that direction for a bit.
|
|
SQLite prior to 3.8.3 did not reset its PRNG for generating
unique temporary file names, so it would barf on t/lei-up.t
occasionally due to O_EXCL -> EEXIST conflicts.
This fixes occasional test failures under CentOS 7.x which ships
SQLite 3.7.17.
|
|
This makes some of our code less noisy by reducing the
amount of pack('H*', ...) use.
|
|
This covers v1 inboxes, as well. We also guard the execution
since "PRAGMA optimize" was only introduced in SQLite 3.18.0
(2017-03-30)
|
|
When unref-ing a blob from xref3, make sure the "preferred"
smsg->{blob} doesn't point to the blob we just unrefed. This
is necessary because we periodically checkpoint our extindex
process to allow -watch and -mda processes to run.
This also gets rid of a lot of redundant code for ->remove_xref3,
since it's all handled in ExtSearchIdx, now.
|
|
This required some tweaking of xref3 indices in over.sqlite3,
but the end result is it brings no-op "--reindex --fast --all"
checks down to roughly 20 minutes (from 30-40 minutes) on
lore/all.
This is faster because a bunch of small SQLite queries are still
slower en-mass than a bunch of perlops. Despite the lack of IPC
overhead, crossing .so boundaries and repeating lookups over
btrees is still slower than doing the same with Perl hash tables.
|
|
This may fix some extindex problems and should get rid of
the "Can't bless non-reference value" errors.
|
|
This should bring us closer to the "Base subject" definition in
IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause
some breakage (until --reindex). But for now, a reindex will
prevents the non-ASCII subjects from being normalized to the
same fuzzy "thread" in the thread view.
|
|
`shard_remove_eidx_info' was made unnecessary with commit
82b805db3ad9 (searchidxshard: IPC conversion, part 2, 2021-01-03)
and we now call `remove_eidx_info' directly.
|
|
None of our code elsewhere accounts for non-*nix pathnames and
it's not worth our time to start. So stop wasting CPU cycles
giving the illusion that we'd care about non-*nix pathnames.
|
|
This is intended to fix older indices that had deduplication
bugs for matching content. It'll also make dealing with
future changes to ContentHash easier since that's never
guaranteed stable.
It also supports --dry-run to print changes only without
making them.
|
|
Since completely purging blobs from git is slow, users may wish
to index messages in Maildirs (and eventually other local
storage) without storing data in git.
Much code from LeiImport and LeiInput is reused, and a new dummy
FakeImport class supplies a non-storing $im->add and minimize
changes to LeiStore.
The tricky part of this command is to support "lei import"
after a message has gone through "lei index". Relying on
$smsg->{bytes} == 0 (as we do for external-only vmd storage)
does not work here, since it would break searching for "z:"
byte-ranges when not using externals.
This eventually required PublicInbox::Import::add to use a
SharedKV to keep track of imported blobs and prevent
duplication.
|
|
We need a stable fallback time for digest2mid in the presence
of messages without Received/Date headers. Furthermore, we
must avoid using uninitialized smsg->{mid} when parsing
References for draft replies.
|
|
"lei q" now preserves changes per-message keywords across
invocations when it's --output (Maildir or mbox) is reused
(with or without --augment).
In the future, these changes will be monitored via inotify,
EVFILT_VNODE or IMAP IDLE, too.
Unfortunately, this currently prevents "lei import" from ever
importing a message that's in an external. That will be fixed
in a future change.
|
|
The PublicInbox::Eml (and previously Email::MIME) use of confess
was the primary (or only) culprit behind the lei2mail segfaults
fixed by commit 0795b0906cc81f40.
("ds: guard against stack-not-refcounted quirk of Perl 5").
We never care about a backtrace when dealing with Eml objects
anyways, so it was just a worthless waste of CPU cycles.
We can also drop confess in a few other places. Since we only
use Perl and Inline::C, users will never be without source
and can replace s/croak/Carp::confess/ on a per-callsite basis
to help report problems.
It's also possible to use PERL5OPT=-MCarp=verbose in the
environment though still potentially risky.
Link: https://public-inbox.org/meta/20210201082833.3293-1-e@80x24.org/
|
|
Having parse_references in OverIdx was awkward and Smsg is
a better place for it.
|
|
Leaving $dbh in another field was causing over.sqlite3 to
remain open after ->dbh_close. Fix up some minor style
issues while we're at it.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
For personal mail, unsent drafts messages are a common source of
messages without Message-IDs.
|
|
* origin/master: (58 commits)
ds: flatten + reuse @events, epoll_wait style fixes
ds: simplify EventLoop implementation
check defined return value for localized slurp errors
import: check for git->qx errors, clearer return values
git: qx: avoid extra "local" for scalar context case
search: remove {mset} option for ->mset method
search: remove pointless {relevance} setting
miscsearch: take reopen from Search and use it
extsearch: unconditionally reopen on access
extindex: allow using --all without EXTINDEX_DIR
extindex: add undocumented --no-scan switch
extindex: enable autoflush on STDOUT/STDERR
extindex: various --watch signal handling fixes
extindex: --watch for inotify-based updates
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
default to CORE::warn in $SIG{__WARN__} handlers
inbox: name variable for values loop iterator
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
...
|
|
This reuses existing InboxIdle infrastructure to update external
indices based on per-inbox updates. This is an alternative to
auto-updating external indices via the -index command and also
works with existing uses of -mda and public-inbox-watch.
Using inotify (or EVFILT_VNODE) allows watching thousands of
inboxes without having to scan every single one at every
invocation.
This is especially beneficial in cases where an external index
is not writable to the users writing to per-inbox indices.
|
|
Still unstable, this builds off the equally unstable extindex :P
This will be used for caching/memoization of traditional mail
stores (IMAP, Maildir, etc) while providing indexing via Xapian,
along with compression, and checksumming from git.
Most notably, this adds the ability to add/remove per-message
keywords (draft, seen, flagged, answered) as described in the
JMAP specification (RFC 8621 section 4.1.1).
We'll use `.' (a single period) as an $eidx_key since it's an
invalid {inboxdir} or {newsgroup} name.
|
|
--reindex allows us to catch missed and stale messages due to
-extindex vs -index races prior to commit 02b2fcc46f364b51
("extsearchidx: enforce -index before -extindex").
We'll also rely on reindex to internally deal with v1/v2 inbox
removals and partial-unindexing of messages which are only
removed from one inbox out of many.
This reindex design is completely different than how normal
v1/v2 inbox reindex operates due to extindex having multiple
histories to work with. Instead of scanning git history, this
relies exclusively on comparing over.sqlite3 contents between
the v1/v2 inboxes and the extindex.
Changes to Xapian behavior also get picked up, now. Xapian indexing
is handled by workers with minimal IPC to the parent process.
This results in more read I/O but fewer writes when dealing
with cross-posted messages.
Changes to $smsg->populate and --rethread still need further
work.
|
|
This makes things a little less noisy and will be
called by ExtSearchIdx.
|
|
INTEGER PRIMARY KEY can be an alias for ROWID in SQLite and is
already unique, so there's no need for a separate UNIQUE(num)
index.
With a smallish ~3K, freshly indexed v2 inbox, this results in a
~40K space savings, reducing over.sqlite3 from 1.375M to 1.335M
(post-VACUUM).
This only affects newly-indexed inboxes; existing DBs will
require manual intervention to take advantage of space savings.
Link: https://www.sqlite.org/rowidtable.html
|
|
We must use the result of link_refs() since it can trigger
merge_threads() and invalidate $old_tid. In case
merge_threads() isn't triggered, link_refs() will return
$old_tid anyways.
When rethreading and allocating new {tid}, we also must update
the row where the now-expired {tid} came from to ensure only the
new {tid} is seen when reindexing subsequent messages in
history. Otherwise, every subsequently reindexed+rethreaded
message could end up getting a new {tid}.
Reported-by: Kyle Meyer <kyle@kyleam.com>
Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
|
|
We need to completely remove a message from over.sqlite3 and
Xapian when no references remain, otherwise users will still see
the removed messages in NNTP overviews and WWW search
results/summaries.
References to messages are now solely handled by the `xref3'
table of over.sqlite3. We can also trust `xref3' when deciding
whether to remove only the "O$eidx_key" and "G$lid" terms from a
document in Xapian or to remove the entire Xapian document.
|
|
Getting Xref for cross-posted messages is an O(n) operation
where `n' is the number of newsgroups on the server. This works
acceptably when there are dozens of groups, but would be
unnacceptable when there's tens of thousands of newsgroups.
With ~140 newsgroups, a lore.kernel.org mirror already handles
"XHDR Xref $MESSAGE_ID" requests around 30% faster after
creating the xref3.idx_nntp index.
The SQL additions to ExtSearch.pm may be a bit strange and
seem more appropriate for Over.pm; however it currently makes
sense to me since those bits of over.sqlite3 access are
exclusive to ExtSearch and can't be used by traditional
v1/v2 inboxes...
|
|
We can now handle cases where messages are edited in one inbox
but not another, bifurcating the message.
V2Writable::log_range handles some edge-cases which could happen
in v2-only code paths, as well, but weren't usually triggered
due to default git-gc knobs not pruning immediately
|
|
We may not end up storing xref3 data in Xapian, actually.
This will make indexlevel=basic possible, and along with
--sequential-shard indexing support for slow storage.
Making oidmap a separate table seems unnecessary, too, so
fold it into the xref3 table since it's unlikely a git blob
will be responsible for multiple xref3 rows.
|
|
Since external indices won't have msgmap.sqlite3, we'll need to
store last_commit-* metadata in over.sqlite3 instead. This
has a longer limits to account for path names or newsgroup names
stored in keys.
We'll also rely on built-in counters for Xapian document IDs,
since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
|
|
This may be useful for keeping our heads on straight dealing
with IMAP, NNTP, JMAP, etc.
|
|
There's no need for this to be a separate sub since there's
only a single caller. This saves a few kilobytes at least
in short-lived processes.
|
|
v5.10.1 lets us use the lighter parent.pm instead of base.pm,
and we'll rely on the shebang to enable warnings (or not).
While we're in the area, drop a no-longer-necessary import for
PublicInbox::Search, since OverIdx doesn't require search.
|
|
Since we got rid of over->connect, `disconnect' no longer pairs
with it. So name it after the `close(2)' syscall it ultimately
issues.
|
|
`->connect' is confused with the perlfunc for the `connect(2)'
syscall, and also `DBI->connect'. Since SQLite doesn't use
sockets, the word "connect" needlessly confuses me. Give
it a short name to match the field name we use for it, which
also matches the variable name used by the DBI(3pm) and
DBD::SQLite(3pm) manpages.
|
|
WAL actually seems to have ideal locking characteristics given
concurrency problems I'm experiencing with --reindex running
in parallel with expensive read-only SQLite queries:
<https://public-inbox.org/meta/20200825001204.GA840@dcvr/>
Unfortunately, we cannot blindly use WAL while preserving
compatibility with existing setups nor our guarantees that
read-only daemons are indeed "read-only".
However, respect an user's the choice to set WAL on their
own if they're comfortable with giving -nntpd/-httpd/-imapd
processes write permission to the directory storing SQLite DBs.
|
|
This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|
|
We used ->header_obj in the past as an optimization with
Email::MIME. That optimization is no longer necessary
with PublicInbox::Eml.
This doesn't make any functional difference even if we were to
go back to Email::MIME. However, it reduces the amount of code
we have and slightly reduces allocations with PublicInbox::Eml.
|
|
We still need to use SQL_BLOB to ensure existing versions of
public-inbox can read over.sqlite3 because they're still using
{sqlite_unicode}. This partially reverts commit
e9fc1290ead44e06d20ff58e0a6acb5306d4fbe2.
Fixes: e9fc1290ead44e06 ("over: unset sqlite_unicode attribute")
|
|
This allows us to speed up indexing operations to SQLite
and Xapian.
Unfortunately, it doesn't affect operations using
`xapian-compact' and the compactor API, since that doesn't seem
to support Xapian::DB_NO_SYNC, yet.
|
|
Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases. This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
|
|
Since over.sqlite3 seems here to stay, we no longer need to do
Message-ID lookups against Xapian and can simply rely on the
docid <=> NNTP article number equivalancy SCHEMA_VERSION=15
gave us.
This rids us of the closure-using batch_do sub in the v1
code path and vastly simplifies both v1 and v2 unindexing.
|
|
OO method dispatch was 10-15% slower when I was implementing the
NNTP server. It also serves as a helpful reminder to the reader
at the callsite as to whether a sub is likely in the same
package as the caller or not.
|
|
This saves runtime allocations and reduces the likelyhood of
memory leaks either from cycles or buggy old Perl versions.
|