about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-02searchidx: remove v1-only msg_mime sub
We can rely on the newer mids() sub directly and use faster numeric comparisons for Msgmap unindexing in v1.
2020-07-29xapcmd: -xcpdb and -compact disable CoW, too
This gives an opportunity for users already suffering from CoW fragmentation to at least get the Xapian DBs off CoW. Aside from over.sqlite3 in v1, the SQLite DBs remain untouched; though VACUUM support may come in the future.
2020-07-29searchidx: disable CoW for SQLite and Xapian under btrfs
SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...
2020-07-25searchidx: $batch_cb => v1_checkpoint
Another closure gone, and we may be able to share more code with v2 in upcoming commits.
2020-07-25searchidx: support async git check
This allows v1 indexing to run while the `cat-file --batch-check' process is waiting on high-latency storage.
2020-07-25v2writable: share log2stack code with v1
Another step in making v1 and v2 more similar.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25searchidx: make v1 indexing closer to v2
We'll switch to using IdxStack here to ensure we get repeatable results and ascending THREADIDs according to git chronology. This means we'll need a two-pass reindex to index existing messages before indexing new messages. Since we no longer have a long-lived git-log process, we don't have to worry about old Xapian referencing the git-log pipe w/o FD_CLOEXEC, either.
2020-07-25searchidx: rename _xdb_{acquire,release} => idx_
The "xdb" prefix was inaccurate since it's used by indexlevel=basic, which is Xapian-free. The '_' (underscore) prefix was also wrong for a method which is called across package boundaries.
2020-07-25search: avoid copying {inboxdir}
Instead, storing {xdir} will allow us to avoid string concatenation in the read-only path and save us a little hash entry space.
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17searchidx: use v5.10.1, parent.pm, drop warnings
Prefer "parent" to "base" since the former is lighter and part of Perl 5.10+. We'll also rely on warnings from "-w" globally (or not) instead of via "use".
2020-07-17with_umask: pass args to callback
While it makes the code flow slightly less well in some places, it saves us runtime allocations and indentation.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-13searchidx: v1 (re)-index uses git asynchronously
We can cleanup some of our v1 code slightly and let git do I/O+decoding in parallel. This gives a slight 2-4% re-indexing performance boost even on an SSD.
2020-06-13search: index UID for IMAP search, too
We'll need to support searching UID ranges for IMAP, so make sure it's indexed, too.
2020-06-13search: index byte size of a message for IMAP search
Searching for messages smaller than a certain size is allowed by offlineimap(1), mbsync(1), and possibly other tools. Maybe public-inbox-watch will support it, too. I don't see a reason to expose searching by size via WWW search right now (but maybe in the future, I could be convinced to). Note: we only store the byte-size of the message in git, this is typically LF-only and we won't have the correct size after CRLF conversion for NNTP or IMAP.
2020-06-05searchidx: v1: fix retries when Xapian and Msgmap are out-of-sync
We forcibly stop git-log here, so erroring out on git-log close failures is wrong since it sees SIGPIPE. Noticed while reindexing a large v1 inbox for IMAP changes. Fixes: b32b47fb12a3043d ("index: "git log" failures are fatal")
2020-06-03smsg: get rid of ->wrap initializer, too
We'll just use `bless' like most current PublicInbox::Smsg callers.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-17descend into message/(rfc822|news|global) parts
Email::MIME never supported this properly, but there's real instances of forwarded messages as message/rfc822 attachments. message/news is legacy thing which we'll see in archives, and message/global appears to be the new thing. gmime also supports message/rfc2822, so we'll support it anyways despite lacking other evidence of its existence. Existing attachments remain downloadable as a whole message, but individual attachments of subparts are now downloadable and can be displayed in HTML, too. Furthermore, ensure Xapian can now search for common headers inside those messages as well as the message bodies.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-05-09msg_iter: pass $idx as a scalar, not array
This doesn't make any difference for most multipart messages (or any single part messages). However, this starts having space savings when parts start nesting. It also slightly simplifies callers.
2020-05-09search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw).
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19searchidx: die on cat-file failures
We always use the object ID from "git <log|rev-list>" for retrieving blobs, so fail loudly if the git repository is corrupt instead of silently continuing.
2020-04-09triewyde: ficks soem speling errrors
Dikshunarees R gude!
2020-04-05release large (non ref) scalars using `undef $sv'
Using `undef EXPR' like a function call actually frees the heap memory associated with the scalar, whereas `$sv = undef' or `$sv = ""' will hold the buffer around until $sv goes out of scope. The `sv_set_undef' documentation in the perlapi(1) manpage explicitly states this: The perl equivalent is "$sv = undef;". Note that it doesn't free any string buffer, unlike "undef $sv". And I've confirmed by reading Dump() output from Devel::Peek. We'll also inline the old index_body sub in SearchIdx.pm to make the scope of the scalar more obvious. This change saves several hundred kB RSS on both -index and -httpd when hitting large emails with thousands of lines.
2020-04-03quiet "Complex regular subexpression recursion limit" warnings
These seem mostly harmless since Perl will just truncate the match and start a new one on a newline boundary in our case. The only downside is we'd end up with redundant <span> tags in HTML. Limiting the number of line matched ourselves with `{1,$NUM}' doesn't seem prudent since lines vary in length, so we continue to defer the job of limiting matches to the Perl regexp engine. I've noticed this warning in practice on 100K+ line patches to locale data.
2020-04-02searchidx: v1: skip mid_clean on mid_mime results
We do not need run mid_clean() since mid_mime() uses mids() to extract the msgid from inside the angle brackets.
2020-03-29searchidxshard: ensure we set indexlevel on shard[0]
For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22*idx: pass $smsg in more places instead of many args
We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22overidx: parse_references: less error-prone args
Favor `$smsg->{mid}' instead of `$mid0' to reduce parameters down-the-line, but favor passing the Email::MIME::Header object around instead of relying on the bloat-prone `$smsg->{mime}' and calling ->header_obj on it.
2020-03-22smsg: to_doc_data: use existing fields
No need to pass extra parameters to this method, since smsg has universal meanings for {blob} and {mid}.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-22index: use git commit times on missing Date/Received
When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27searchidx: don't assume "a/" and "b/" as prefixes
Some people use "--{src,dst}-prefix=", try to deal with those since git-apply can handle them when called by solver.
2020-01-27searchidx: skip filenames on "diff --git ..."
We already capture filenames on the lines beginning with "---" and "+++", so it's redundant work to capture filenames from "diff --git ..." lines.
2020-01-27search: {version} => {ibx_ver}
We don't confuse human readers with the Xapian schema version. We also want to make it obvious this is the version of the inbox we're indexing, these are Search or SearchIdx objects, not Inbox objects.