about summary refs log tree commit homepage
path: root/lib/PublicInbox/OverIdx.pm
DateCommit message (Collapse)
2021-10-16smsg: add ->oidbin method
This makes some of our code less noisy by reducing the amount of pack('H*', ...) use.
2021-10-13index: optimize after all SQLite DB commits
This covers v1 inboxes, as well. We also guard the execution since "PRAGMA optimize" was only introduced in SQLite 3.18.0 (2017-03-30)
2021-10-12extindex: avoid invalid blobs after unref
When unref-ing a blob from xref3, make sure the "preferred" smsg->{blob} doesn't point to the blob we just unrefed. This is necessary because we periodically checkpoint our extindex process to allow -watch and -mda processes to run. This also gets rid of a lot of redundant code for ->remove_xref3, since it's all handled in ExtSearchIdx, now.
2021-10-12extindex: speed up --reindex --fast
This required some tweaking of xref3 indices in over.sqlite3, but the end result is it brings no-op "--reindex --fast --all" checks down to roughly 20 minutes (from 30-40 minutes) on lore/all. This is faster because a bunch of small SQLite queries are still slower en-mass than a bunch of perlops. Despite the lack of IPC overhead, crossing .so boundaries and repeating lookups over btrees is still slower than doing the same with Perl hash tables.
2021-10-08overidx: each_by_mid: account for messages being deleted
This may fix some extindex problems and should get rid of the "Can't bless non-reference value" errors.
2021-10-06overidx: subject_path: allow non-ASCII char in subject matches
This should bring us closer to the "Base subject" definition in IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause some breakage (until --reindex). But for now, a reindex will prevents the non-ASCII subjects from being normalized to the same fuzzy "thread" in the thread view.
2021-10-05overidx: update comment for new sub name
`shard_remove_eidx_info' was made unnecessary with commit 82b805db3ad9 (searchidxshard: IPC conversion, part 2, 2021-01-03) and we now call `remove_eidx_info' directly.
2021-08-11treewide: use *nix-specific dirname regexps
None of our code elsewhere accounts for non-*nix pathnames and it's not worth our time to start. So stop wasting CPU cycles giving the illusion that we'd care about non-*nix pathnames.
2021-07-06extindex: implement --dedupe to fix old extindices
This is intended to fix older indices that had deduplication bugs for matching content. It'll also make dealing with future changes to ContentHash easier since that's never guaranteed stable. It also supports --dry-run to print changes only without making them.
2021-05-04lei index: new command to index mail w/o git storage
Since completely purging blobs from git is slow, users may wish to index messages in Maildirs (and eventually other local storage) without storing data in git. Much code from LeiImport and LeiInput is reused, and a new dummy FakeImport class supplies a non-storing $im->add and minimize changes to LeiStore. The tricky part of this command is to support "lei import" after a message has gone through "lei index". Relying on $smsg->{bytes} == 0 (as we do for external-only vmd storage) does not work here, since it would break searching for "z:" byte-ranges when not using externals. This eventually required PublicInbox::Import::add to use a SharedKV to keep track of imported blobs and prevent duplication.
2021-04-03lei: improve handling of Message-ID-less draft messages
We need a stable fallback time for digest2mid in the presence of messages without Received/Date headers. Furthermore, we must avoid using uninitialized smsg->{mid} when parsing References for draft replies.
2021-03-21lei q: support vmd for external-only messages
"lei q" now preserves changes per-message keywords across invocations when it's --output (Maildir or mbox) is reused (with or without --augment). In the future, these changes will be monitored via inotify, EVFILT_VNODE or IMAP IDLE, too. Unfortunately, this currently prevents "lei import" from ever importing a message that's in an external. That will be fixed in a future change.
2021-02-07treewide: replace confess with croak
The PublicInbox::Eml (and previously Email::MIME) use of confess was the primary (or only) culprit behind the lei2mail segfaults fixed by commit 0795b0906cc81f40. ("ds: guard against stack-not-refcounted quirk of Perl 5"). We never care about a backtrace when dealing with Eml objects anyways, so it was just a worthless waste of CPU cycles. We can also drop confess in a few other places. Since we only use Perl and Inline::C, users will never be without source and can replace s/croak/Carp::confess/ on a per-callsite basis to help report problems. It's also possible to use PERL5OPT=-MCarp=verbose in the environment though still potentially risky. Link: https://public-inbox.org/meta/20210201082833.3293-1-e@80x24.org/
2021-01-24smsg: make parse_references an object method
Having parse_references in OverIdx was awkward and Smsg is a better place for it.
2021-01-21overidx: eidx_prep: fix leftover dbh reference
Leaving $dbh in another field was causing over.sqlite3 to remain open after ->dbh_close. Fix up some minor style issues while we're at it.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01lei_store: handle messages without Message-ID at all
For personal mail, unsent drafts messages are a common source of messages without Message-IDs.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-27extindex: --watch for inotify-based updates
This reuses existing InboxIdle infrastructure to update external indices based on per-inbox updates. This is an alternative to auto-updating external indices via the -index command and also works with existing uses of -mda and public-inbox-watch. Using inotify (or EVFILT_VNODE) allows watching thousands of inboxes without having to scan every single one at every invocation. This is especially beneficial in cases where an external index is not writable to the users writing to per-inbox indices.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-08overidx: wrap eidx_key => ibx_id mapping
This makes things a little less noisy and will be called by ExtSearchIdx.
2020-12-07overidx: {num} column is INTEGER PRIMARY KEY
INTEGER PRIMARY KEY can be an alias for ROWID in SQLite and is already unique, so there's no need for a separate UNIQUE(num) index. With a smallish ~3K, freshly indexed v2 inbox, this results in a ~40K space savings, reducing over.sqlite3 from 1.375M to 1.335M (post-VACUUM). This only affects newly-indexed inboxes; existing DBs will require manual intervention to take advantage of space savings. Link: https://www.sqlite.org/rowidtable.html
2020-12-05over: ensure old, merged {tid} is really gone
We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-11-29extindex: fix delete (`d') handling
We need to completely remove a message from over.sqlite3 and Xapian when no references remain, otherwise users will still see the removed messages in NNTP overviews and WWW search results/summaries. References to messages are now solely handled by the `xref3' table of over.sqlite3. We can also trust `xref3' when deciding whether to remove only the "O$eidx_key" and "G$lid" terms from a document in Xapian or to remove the entire Xapian document.
2020-11-28nntp: xref: use ->ALL extindex if available
Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07overidx: introduce changes for external index
Since external indices won't have msgmap.sqlite3, we'll need to store last_commit-* metadata in over.sqlite3 instead. This has a longer limits to account for path names or newsgroup names stored in keys. We'll also rely on built-in counters for Xapian document IDs, since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
2020-09-03overidx: document column uses
This may be useful for keeping our heads on straight dealing with IMAP, NNTP, JMAP, etc.
2020-08-27overidx: inline create_ghost sub
There's no need for this to be a separate sub since there's only a single caller. This saves a few kilobytes at least in short-lived processes.
2020-08-27over*: use v5.10.1, drop warnings
v5.10.1 lets us use the lighter parent.pm instead of base.pm, and we'll rely on the shebang to enable warnings (or not). While we're in the area, drop a no-longer-necessary import for PublicInbox::Search, since OverIdx doesn't require search.
2020-08-27over: rename ->disconnect to ->dbh_close
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-27over: rename ->connect method to ->dbh
`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-08-26over+msgmap: respect WAL journal_mode if set
WAL actually seems to have ideal locking characteristics given concurrency problems I'm experiencing with --reindex running in parallel with expensive read-only SQLite queries: <https://public-inbox.org/meta/20200825001204.GA840@dcvr/> Unfortunately, we cannot blindly use WAL while preserving compatibility with existing setups nor our guarantees that read-only daemons are indeed "read-only". However, respect an user's the choice to set WAL on their own if they're comfortable with giving -nntpd/-httpd/-imapd processes write permission to the directory storing SQLite DBs.
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-07-26overidx: fix compatibility with current versions
We still need to use SQL_BLOB to ensure existing versions of public-inbox can read over.sqlite3 because they're still using {sqlite_unicode}. This partially reverts commit e9fc1290ead44e06d20ff58e0a6acb5306d4fbe2. Fixes: e9fc1290ead44e06 ("over: unset sqlite_unicode attribute")
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17overidx: favor non-OO sub dispatch for internal subs
OO method dispatch was 10-15% slower when I was implementing the NNTP server. It also serves as a helpful reminder to the reader at the callsite as to whether a sub is likely in the same package as the caller or not.
2020-07-17overidx: each_by_mid: pass self and args to callbacks
This saves runtime allocations and reduces the likelyhood of memory leaks either from cycles or buggy old Perl versions.
2020-07-14over+msgmap: do not store filename after DBI->connect
SQLite already knows the filename internally, so avoid having it as a long-lived Perl SV to save some bytes when there's many inboxes and open DBs.
2020-07-14over: unset sqlite_unicode attribute
None of the human-readable strings stored in over.sqlite3 require UTF-8. Message-IDs do not, nor do the compressed Subject IDs (sid) we use for Subject-based threading. And the `ddd' (doc-data-deflated) column is of course binary data. This frees us of having to use SQL_BLOB for the `ddd', column, and will open the door for us to use dbh_new for Msgmap, too.
2020-07-02overidx: document why we don't use SQLite WAL
I was wondering about this myself the other day and had to read up on it. So make a note of it for future readers.
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03smsg: get rid of remaining {mime} users
We'll let $smsg->populate take care of everything all at once without hanging onto the header object for too long.
2020-05-12overidx: document the SQLite PRAGMA we use
This ought to prevent cargo-culting the cache_size PRAGMA into smaller SQLite DBs we might use.