about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2020-11-07searchidx: put {ibx} into $sync state
This will allow reusability with ExtSearchIdx
2020-11-07searchidxshard: special init for eidx
Having a special init path for external indices is probably easier than further overloading SearchIdx->new initialization to work without an Inbox object.
2020-11-07searchidx: xref3 delete support
Not yet tested, but Perl compiles it!
2020-11-07searchidx: index eidx_key as a boolean term
Using `O' (owner) here (according Xapian omega's termprefixes.rst) since we could say the newsgroup or inbox is the owner of the given message.
2020-11-07inboxwritable: eidx_key for external index
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-11-07searchidx: expose INDEXLEVELS as `our'
This will be used by external/detached indices, too.
2020-10-17git: introduce async_wait_all
->cat_async and ->check_async may trigger each other (in future callers) while waiting, so we need a unified method to ensure both complete. This doesn't affect current code, but allows us to slightly simplify existing callers.
2020-09-29searchidx: index lower-case List-Id value
We don't want a List-Id value being confused with a Xapian term prefix, here. Followup-to: 8b06cda3a3af3f0e ("mda: match List-Id insensitively")
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-08-25searchidx: croak for Xapian DB open failure
croak() can give more context on the failure, and setting `PERL5OPT=-MCarp=verbose' can force a stacktrace.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-23searchidx: put all shard-related stuff in SearchIdxShard.pm
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-20init+index: support --skip-docdata for Xapian
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-10searchidx: use singular `$opt' for consistency with v2
The rest of our indexing code uses `$opt' instead of `$opts'.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-08support setting No_COW on Perl <5.22
fileno(DIRHANDLE) only works on Perl 5.22+, so we need to use dirfd(3) ourselves from Inline::C (or rely on chattr(1) being installed). While we're at it, rename `set_nodatacow' to `nodatacow_fd' for consistency with `nodatacow_dir'.
2020-08-07searchidx: use Perl truthiness to detect XAPIAN_FLUSH_THRESHOLD
XAPIAN_FLUSH_THRESHOLD is a C string in the environment, so users may be tempted to assign an empty string in in their shell, e.g. `XAPIAN_FLUSH_THRESHOLD= <command>' instead of using `unset' POSIX shell built-in. With either a value of "0" or "" (empty string), Xapian will fall back to its default (10000 documents), which causes grief for memory-starved users.
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-02searchidx: remove v1-only msg_mime sub
We can rely on the newer mids() sub directly and use faster numeric comparisons for Msgmap unindexing in v1.
2020-07-29xapcmd: -xcpdb and -compact disable CoW, too
This gives an opportunity for users already suffering from CoW fragmentation to at least get the Xapian DBs off CoW. Aside from over.sqlite3 in v1, the SQLite DBs remain untouched; though VACUUM support may come in the future.
2020-07-29searchidx: disable CoW for SQLite and Xapian under btrfs
SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...
2020-07-25searchidx: $batch_cb => v1_checkpoint
Another closure gone, and we may be able to share more code with v2 in upcoming commits.
2020-07-25searchidx: support async git check
This allows v1 indexing to run while the `cat-file --batch-check' process is waiting on high-latency storage.
2020-07-25v2writable: share log2stack code with v1
Another step in making v1 and v2 more similar.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25searchidx: make v1 indexing closer to v2
We'll switch to using IdxStack here to ensure we get repeatable results and ascending THREADIDs according to git chronology. This means we'll need a two-pass reindex to index existing messages before indexing new messages. Since we no longer have a long-lived git-log process, we don't have to worry about old Xapian referencing the git-log pipe w/o FD_CLOEXEC, either.
2020-07-25searchidx: rename _xdb_{acquire,release} => idx_
The "xdb" prefix was inaccurate since it's used by indexlevel=basic, which is Xapian-free. The '_' (underscore) prefix was also wrong for a method which is called across package boundaries.
2020-07-25search: avoid copying {inboxdir}
Instead, storing {xdir} will allow us to avoid string concatenation in the read-only path and save us a little hash entry space.
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17searchidx: use v5.10.1, parent.pm, drop warnings
Prefer "parent" to "base" since the former is lighter and part of Perl 5.10+. We'll also rely on warnings from "-w" globally (or not) instead of via "use".
2020-07-17with_umask: pass args to callback
While it makes the code flow slightly less well in some places, it saves us runtime allocations and indentation.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-13searchidx: v1 (re)-index uses git asynchronously
We can cleanup some of our v1 code slightly and let git do I/O+decoding in parallel. This gives a slight 2-4% re-indexing performance boost even on an SSD.
2020-06-13search: index UID for IMAP search, too
We'll need to support searching UID ranges for IMAP, so make sure it's indexed, too.
2020-06-13search: index byte size of a message for IMAP search
Searching for messages smaller than a certain size is allowed by offlineimap(1), mbsync(1), and possibly other tools. Maybe public-inbox-watch will support it, too. I don't see a reason to expose searching by size via WWW search right now (but maybe in the future, I could be convinced to). Note: we only store the byte-size of the message in git, this is typically LF-only and we won't have the correct size after CRLF conversion for NNTP or IMAP.
2020-06-05searchidx: v1: fix retries when Xapian and Msgmap are out-of-sync
We forcibly stop git-log here, so erroring out on git-log close failures is wrong since it sees SIGPIPE. Noticed while reindexing a large v1 inbox for IMAP changes. Fixes: b32b47fb12a3043d ("index: "git log" failures are fatal")
2020-06-03smsg: get rid of ->wrap initializer, too
We'll just use `bless' like most current PublicInbox::Smsg callers.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-17descend into message/(rfc822|news|global) parts
Email::MIME never supported this properly, but there's real instances of forwarded messages as message/rfc822 attachments. message/news is legacy thing which we'll see in archives, and message/global appears to be the new thing. gmime also supports message/rfc2822, so we'll support it anyways despite lacking other evidence of its existence. Existing attachments remain downloadable as a whole message, but individual attachments of subparts are now downloadable and can be displayed in HTML, too. Furthermore, ensure Xapian can now search for common headers inside those messages as well as the message bodies.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.