about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2020-08-02nntp: fix STAT command
The return value of art_lookup changed but this command wasn't updated since it wasn't tested. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-08-01www: rework async_* to use method table
Although the ->async_next method does not take $self as a receiver, but rather a PublicInbox::HTTP object, we may still retrieve it to be called with the HTTP object via UNIVERSAL->can.
2020-07-31lock: show failure path
This ought to be useful for diagnosing bugs in -watch.
2020-07-30msgmap: disable CoW for tmp_clone, too
The temporary clone starts as large as the full msgmap and deletes will write to it randomly. So ensure it doesn't get fragmented and slower as time goes on.
2020-07-30wwwlisting: fix grep call for match=domain filtering
The grep call in list_match_domain_i returns true for all inboxes, even ones without a URL that matches the regular expression, because the qr value passed to grep is not surrounded by slashes. Add them. Fixes: 1988d730c0088e8b (config: support multi-value inbox.*.*url)
2020-07-29emergency: create full path to PI_EMERGENCY
It's possible for ~/.public-inbox/ to not exist if PI_CONFIG points to an alternate location. Only noticed from the previous patch fixing t/init.t behavior.
2020-07-29xapcmd: -xcpdb and -compact disable CoW, too
This gives an opportunity for users already suffering from CoW fragmentation to at least get the Xapian DBs off CoW. Aside from over.sqlite3 in v1, the SQLite DBs remain untouched; though VACUUM support may come in the future.
2020-07-29searchidx: disable CoW for SQLite and Xapian under btrfs
SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-29v2writable: use {inboxdir} for msgmap->tmp_clone
Otherwise, a user is more likely to remove the msgmap-XXXXXXXX SQLite file from $TMPDIR and cause SQLite to error out.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...
2020-07-26imap: introduce and use Git->async_prefetch
We can keep the git process more active by sending another request to it while fetch_run_ops() is running. This parallelization speeds up mutt's initial FETCH for headers by around ~35%(!).
2020-07-26index: --compact respects --jobs
And -compact supports --jobs=0 like -index to disable parallel execution. Running three xapian-compact processes in parallel on a USB 2.0 HDD is pretty painful.
2020-07-26overidx: fix compatibility with current versions
We still need to use SQL_BLOB to ensure existing versions of public-inbox can read over.sqlite3 because they're still using {sqlite_unicode}. This partially reverts commit e9fc1290ead44e06d20ff58e0a6acb5306d4fbe2. Fixes: e9fc1290ead44e06 ("over: unset sqlite_unicode attribute")
2020-07-25v2writable: {unindexed} belongs in $sync state
There's no reason for {unindexed} to persist beyond an ->index_sync call.
2020-07-25searchidx: $batch_cb => v1_checkpoint
Another closure gone, and we may be able to share more code with v2 in upcoming commits.
2020-07-25searchidx: support async git check
This allows v1 indexing to run while the `cat-file --batch-check' process is waiting on high-latency storage.
2020-07-25v2writable: share log2stack code with v1
Another step in making v1 and v2 more similar.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25searchidx: make v1 indexing closer to v2
We'll switch to using IdxStack here to ensure we get repeatable results and ascending THREADIDs according to git chronology. This means we'll need a two-pass reindex to index existing messages before indexing new messages. Since we no longer have a long-lived git-log process, we don't have to worry about old Xapian referencing the git-log pipe w/o FD_CLOEXEC, either.
2020-07-25searchidx: rename _xdb_{acquire,release} => idx_
The "xdb" prefix was inaccurate since it's used by indexlevel=basic, which is Xapian-free. The '_' (underscore) prefix was also wrong for a method which is called across package boundaries.
2020-07-25xapcmd: set {from} properly for v1 inboxes
This was a bug, but I'm not sure where it matters, yet, but it may matter in the future.
2020-07-25v2writable: clarify "epoch" comment
2020-07-25v2writable: get rid of {reindex_pipe} field
Since normal per-epoch indexing no longer holds a "git log" process open, we don't need to worry about not sharing the pipe with forked shards when we restart the indexer. While we're in the area, better describe what `unindex' does, since it's a rarely-used but necessary code path.
2020-07-25v2writable: use read-only PublicInbox::Git for cat_file
We can reduce the number of parameters we pass around on stack and make our read-write and read-only code paths more uniform.
2020-07-25search: avoid copying {inboxdir}
Instead, storing {xdir} will allow us to avoid string concatenation in the read-only path and save us a little hash entry space.
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-25v2writable: drop "EPOCH.git indexing $RANGE" progress
It'll be one continuous range with IdxStack.
2020-07-25v2writable: allow >= 40 byte git object IDs
Another step in slowly updating our code to support SHA-256 or whatever other hash algorithms git may support in the future.
2020-07-25v2writable: move {autime} and {cotime} into $sync state
The V2Writable object may be long-lived, so it makes more sense to put the {autime} and {cotime} fields into the shorter-lived index_sync state.
2020-07-25v2writable: index_sync: reduce fill_alternates calls
Instead of doing fill_alternates for every epoch we're indexing, just do it once at the start of index_sync invocation. This will set us up for using a single "git cat-file" process for indexing multiple epochs.
2020-07-25v2writable: introduce idx_stack
This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25v2: index forwards (via `git log --reverse')
Since we'll need to expose THREADID to JMAP and IMAP users, index all messages in the order they were committed to ensure our `tid' (thread ID) column ascends in mirrors the same way they do in the source inbox. This drastically simplifies our code but increases memory usage of `git-log'. The next commit will bring memory use back down at the expense of $TMPDIR usage.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-18msgmap: fix atfork_* callbacks
Noticed while reindexing a largish v2 inbox in parallel on an SSD which required checkpointing and respawning shard workers. Fixes: f06e84220e5566e7 ("over+msgmap: do not store filename after DBI->connect")
2020-07-17v2writable: git_hash_raw: avoid $TMPDIR write
We can rely on FD_CLOEXEC being set by default (since Perl 5.6+) on pipes to avoid FS/page-cache traffic, here. We also know "git hash-object" won't output anything until it's consumed all of its standard input; so there's no danger of a deadlock even in the the unlikely case git uses a hash that can't fit into PIPE_BUF :P
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17searchidx: use v5.10.1, parent.pm, drop warnings
Prefer "parent" to "base" since the former is lighter and part of Perl 5.10+. We'll also rely on warnings from "-w" globally (or not) instead of via "use".
2020-07-17overidx: favor non-OO sub dispatch for internal subs
OO method dispatch was 10-15% slower when I was implementing the NNTP server. It also serves as a helpful reminder to the reader at the callsite as to whether a sub is likely in the same package as the caller or not.
2020-07-17overidx: each_by_mid: pass self and args to callbacks
This saves runtime allocations and reduces the likelyhood of memory leaks either from cycles or buggy old Perl versions.
2020-07-17with_umask: pass args to callback
While it makes the code flow slightly less well in some places, it saves us runtime allocations and indentation.
2020-07-17import: use common capitalization for filtering headers
In case this ends up in the same process as Mbox::msg_hdr, it can reduce memory use by sharing the cache key in PublicInbox::Eml::re_memo
2020-07-17drop binmode usage
We only support Unix-like platforms where binmode (":raw") is the default anyways, and v5.10 semantics means it won't do unicode_strings (unlike v5.12). So save some lines of code.
2020-07-17v2: use v5.10.1, parent.pm, drop warnings
The "5.010_001" form was for Perl 5.6, which I doubt anybody would attempt; so favor "v5.10.1" as it is more readable to humans. Prefer "parent" to "base" since the former is lighter. We'll also rely on warnings from "-w" globally (or not) instead of via "use". We'll also update "use" statements to reflect what's actually used by V2Writable.
2020-07-17config: reject `\n' in `inboxdir'
"\n" and other characters requiring quoting and/or escaping in in $GIT_DIR/objects/info/alternates was not supported in git 2.11 and earlier; nor does it seem supported at all in libgit2. This will allow us to support sharing git-cat-file or similar endpoints across multiple inboxes via alternates. This breaks an existing use case for anybody wacky enough to put `\n' in the `inboxdir' pathname; but I doubt this affects anybody.
2020-07-14over+msgmap: do not store filename after DBI->connect
SQLite already knows the filename internally, so avoid having it as a long-lived Perl SV to save some bytes when there's many inboxes and open DBs.
2020-07-14nntpd+imapd: detect unlinked msgmap
While it's even less common to experience a replaced msgmap.sqlite3 file, BOFHs may do the darndest things. This is another step towards reducing the number of needless wakeups we need to do in long-lived read-only daemons.
2020-07-14over: unset sqlite_unicode attribute
None of the human-readable strings stored in over.sqlite3 require UTF-8. Message-IDs do not, nor do the compressed Subject IDs (sid) we use for Subject-based threading. And the `ddd' (doc-data-deflated) column is of course binary data. This frees us of having to use SQL_BLOB for the `ddd', column, and will open the door for us to use dbh_new for Msgmap, too.
2020-07-14xapcmd: delay over->check_inodes trigger
We must not trigger wakeups on InboxIdle users until after we've renamed all files into place. Otherwise, the InboxIdle caller may just reopen the old (soon-to-be-unlinked) file. This fixes occasional test failures in t/nntpd.t Fixes: f977826a17f8735e ("lock: reduce inotify wakeups")
2020-07-13imap: SEARCH fails more gracefully in non-slice mailbox
Instead of returning "BAD program fault", just give the standard "BAD search not available"... message we show for mailbox slices.