about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2021-01-01sharedkv: split out index_values
In most cases, we won't need to index by value, so don't waste cycles or space on it.
2021-01-01sharedkv: fork()-friendly key-value store
This is intended for maintaining Maildir states, mbox message deduplication, but may be useful for other purposes...
2021-01-01lei_to_mail: initial implementation for writing mbox formats
No Maildir, support, yet, but it'll come.
2021-01-01revert "lei_store: use per-machine refname as git HEAD"
In retrospect, per-machine HEADs was a bad idea because users of removable storage would be thrown off when moving storage between different machines. This is only a partial revert, the Import::init_bare change to support alternate head names still exists because we may use it for other reasons.
2021-01-01lei_store: use per-machine refname as git HEAD
It may be helpful to identify the source of messages and perhaps avoid conflicting history. On the other hand, this may be a terrible idea for users who move portable storage (e.g. USB sticks) across computers...
2021-01-01import: respect init.defaultBranch
This matches git v2.28.0+ behavior in case users prefer a different name.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31lei: rename proposed "query" command to "q", add JSON output
Using "query" as a verb may be confusing when we'll also refer to them as nouns with the "<ls|rm|mv>-query" sub commands. "query" is also many characters to type without tab-completion on what I expect to be one of the most commonly used sub-commands Furthermore, "q" is also the common query parameter name used by our PSGI interface, as is the case with several major web search engines; so there's an element of familiarity there. The name "search" was disregarded because "show" could be a commonly used lei sub-command, too, and typing "se" for tab-completion may be slow since two-handed typists on QWERTY keyboards won't be able to use alternating hands. "f" or "find" could be a possibility here, too; but we're currently using the term "forget" as a weaker version of "remove" or "rm", though "ignore" could be substituted for "forget", perhaps... Kyle Meyer noted the lack of (proposed) JSON output support so that's been added to the proposed UI.
2020-12-31lei_xsearch: cross-(inbox|extindex) search
While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-28ds: flatten + reuse @events, epoll_wait style fixes
Consistently returning the equivalent of pollfd.revents in a portable manner was never worth the effort for us, as we use the same ->event_step callback regardless of POLLIN/POLLOUT/POLLHUP. Being a Perl, @events knows it size and we don't have to return a maximum index for the caller to iterate on. We can also avoid redundant integer coercion ("+0") since we ensure everything is an IV in other places. Finally, vec() is preferable to ("\0" x $size) for resizing buffers because it only needs to write the extended portion and not overwrite the entire buffer.
2020-12-28ds: simplify EventLoop implementation
More importantly, make it easier-to-find the sub by avoiding runtime manipulation of subroutine names. There's no point in avoiding a potential call to _InitPoller in EventLoop since entering EventLoop is rare. On the contrary, PublicInbox::DS->new is called often and this change to avoid entering _InitPoller there may have more benefits (which may still be unmeasurable).
2020-12-28check defined return value for localized slurp errors
Reading from regular files (even on STDIN) can fail when dealing with flakey storage.
2020-12-28import: check for git->qx errors, clearer return values
Those git commands can fail and git->qx will set $? when it fails. There's no need for the extra indirection of the @ret array, either. Improve git->qx coverage to check for $? while we're at it.
2020-12-28git: qx: avoid extra "local" for scalar context case
We can use the ternary operator to avoid an early return, here
2020-12-28search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so naming a parameter {mset} is too confusing. As it does with MiscSearch, setting the {relevance} parameter to -1 now sorts by ascending docid order. -2 is now supported for descending docid order, too, since it may be useful for lei users.
2020-12-28search: remove pointless {relevance} setting
SearchView will set it to `undef', others will set the 'mset' option (for the ->mset method :P) to 2 which causes {relevance} to be ignored. And the 'mset' option is poorly named now that the message is named ->mset...
2020-12-28miscsearch: take reopen from Search and use it
As with ExtSearch, MiscSearch lacks a janky cleanup timer of PublicInbox::Inbox objects, leading to info about inboxes/newsgroups going stale. Fortunately, we don't use MiscSearch very heavily, yet. In the future, we may be able to detect new inboxes without having to SIGHUP or restart daemons using MiscSearch.
2020-12-28extsearch: unconditionally reopen on access
Since ExtSearch lacks the janky cleanup timer of PublicInbox::Inbox objects, its search results get stale. Reopen the Xapian DB on every ->search call for now, as reducing reopen calls doesn't seem worth the complexity. The Xapian::Database::reopen operation itself takes only ~50us on my old workstation with 3 shards totaling <200GB. Other parts of Xapian dominates the search time, so the reopen seems inconsequential with single-digit shard counts.
2020-12-27extindex: add undocumented --no-scan switch
This makes diagnosing --watch problems easier when there's 50K inboxes by avoiding the lengthy scan (which is the reason --watch exists in the first place).
2020-12-27extindex: various --watch signal handling fixes
We need to clobber the SIGUSR1 resync queue on SIGHUP to invalidate old inbox objects. Furthermore, the lengthy initial scan needs to ignore signals intended for the event loop to avoid unexpected behavior. Finally, add some progress output to inform users on the terminal to inform users' of progress.
2020-12-27extindex: --watch for inotify-based updates
This reuses existing InboxIdle infrastructure to update external indices based on per-inbox updates. This is an alternative to auto-updating external indices via the -index command and also works with existing uses of -mda and public-inbox-watch. Using inotify (or EVFILT_VNODE) allows watching thousands of inboxes without having to scan every single one at every invocation. This is especially beneficial in cases where an external index is not writable to the users writing to per-inbox indices.
2020-12-26eml: fix undefined vars on <Perl 5.28
Encode::MIME::Header::_decode_octets did not correctly default to Encode::FB_DEFAULT until Encode 2.93 (perl5.git commit 0c541dc5633a341cf44b818014b58e7f8be532e9). Provide the default again to work with older Perls. Reported-by: Ali Alnubani <alialnu@nvidia.com> Link: https://public-inbox.org/meta/DM6PR12MB49106F8E3BD697B63B943A22DADB0@DM6PR12MB4910.namprd12.prod.outlook.com/ Tested-by: Ali Alnubani <alialnu@nvidia.com>
2020-12-26default to CORE::warn in $SIG{__WARN__} handlers
As with CORE::die and $SIG{__DIE__}, it turns out CORE::warn is safe to use inside $SIG{__WARN__} handlers without triggering infinite recursion. So fall back to reusing CORE::warn instead of creating a new sub.
2020-12-26inbox: name variable for values loop iterator
->on_inbox_unlock callbacks could clobber $_, and this seems to fix a problem with -extindex --watch failing to index some inboxes after SIGHUP reload.
2020-12-26inboxidle: avoid needless syscalls on refresh
We don't have to replace a bunch of existing watches with identical new ones. On Linux with Linux::Inotify2 installed, this avoids a storm of inotify_add_watch(2) and inotify_rm_watch(2) syscalls on SIGHUP with -imapd and "-extindex --watch"
2020-12-26inboxidle: clue users into resolving ENOSPC from inotify
It may not be obvious to users a ENOSPC error is from hitting a (tunable) kernel-imposed limit on inotify watches, and not some storage device running out of space. Give them a hint here to reduce our own support burden.
2020-12-26v2writable: don't verify tip if reindexing
We only rely on git-rev-parse to resolve symbolic names ("HEAD") to a SHA-* git commit ID. We'll assume any git commit IDs we get from SQLite DBs are valid and let "git-log" fail if it isn't.
2020-12-26index: fix --no-fsync flag propagation to extindex
Negation in flag names are confusing, but trying to deviate from the DB_NO_SYNC name used by Xapian is also confusing.
2020-12-26index: do not attach inbox to extindex unless updated
We'll count the number of log changes (regardless of index or unindex) and only attach inboxes to ExtSearchIdx objects when they get new work. We'll also reduce lock bouncing and only update external indices after all per-inbox indexing is done. This also updates existing v2 indexing/unindexing callers to be more consistent and ensures unindex log entries update per-inbox last commit information.
2020-12-26extsearchidx: close DB handles after use if FD constrained
Most distros ship with low RLIMIT_NOFILE limits and surprises may lurk for admins who configure many inboxes. Keep FD usage under control to avoid EMFILE errors at inopportune times during reindex. From what I can tell, this is the only place where extindex can have unpredictable FD growth when there's thousands of inboxes, and it's in an extremely rare code path.
2020-12-26extsearchidx: delay SQLite availability checks
This will make attach_inbox faster for no-op calls. It also helps us avoid races in case msgmap or over.sqlite3 gets unlinked while -extindex is running.
2020-12-25index: support --fast-noop / -F switch
Note: I'm not sure if it's worth documenting and supporting this long-term. We can can avoid taking locks for invocations of "index --all" and rely on high-resolution ctime (struct timespec st_ctim) comparisons of msgmap.sqlite3 and the packed-refs + refs/heads directory of the newest epoch. This cuts public-inbox-index invocations with "--all --no-update-extindex -L basic" down from 0.92s to 0.31s. The change with "-L medium" or "-L full" and (default) non-zero jobs is even more drastic, reducing a 12-13s no-op invocation down to the same 0.31s
2020-12-25inboxwritable: delay umask_prepare calls
This simplifies all ->with_umask callers and opens the door for further optimizations to delay/elide process spawning.
2020-12-23config: config_fh_parse: micro-optimize harder
Instead of relying on split() and a regexp, we'll drop split() entirely and rely on index() + two substr() calls to operate on fixed strings. This brings PublicInbox::Config->new time down from 0.98s down to 0.84s.
2020-12-23config: config_fh_parse: micro-optimize
We can avoid a slow regexp capture and instead and rely on rindex + substr to extract the section from the config file. Then we use the defined-or-assignment (//=) operator combined with the documented return value of `push' to ensure @section_order is unique without repeating a hash lookup. Finally, we avoid short-lived variables inside the loop and declare them subroutine-wide to knock a teeny bit of allocation time. Combined, these optimizations bring the ~1.22s PublicInbox::Config->new time down to ~0.98s with 50K inboxes.
2020-12-23config: git_config_dump: pre-compile RE for split
It appears the Perl split() operator is not optimized for fixed strings at all. With this change, PublicInbox::Config->new (w/o ->fill_all) time is reduced from 1.81s to 1.22s on a config file with 50K inboxes.
2020-12-23config: _fill: inbox name extraction optimization
Using substr() instead of a string copy + s// substitution here reduces ->fill_all from 4.00s to 3.88s with 50K inboxes on my workstation.
2020-12-23extsearchidx: close SQLite handles after attaching
This is needed to prevent us from running out of FDs when indexing many inboxes. Perhaps checking these on attach_inbox is unnecessary and may be removed entirely down the line.
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-23inboxwritable: _init_v1: set created_at ASAP
This ensures we have UIDVALIDITY to index earlier rather than later for v1 inboxes, matching v2 behavior.
2020-12-23inbox: git_epoch: correct false comment
The original comment hasn't been true since PublicInbox::Git->modified was changed to use cat_async blob responses. In any case, manifest.js.gz generation already cleans up per-epoch git processes used for ->modified.
2020-12-23miscsearch: load Xapian at initialization
We need Xapian bindings loaded before calling (Search::)Xapian::Database->new
2020-12-22wwwstream: show relative coderepo URLs correctly
Trying to link "foo.git" relative to the current URL usually does not provide correct results, so prefix it by going into the parent directory if an absolute (or protocol-relative) URL is not supplied.
2020-12-22admin: resolve inboxes to absolute paths for index
Some of my ancient v1-only scripts called public-inbox-index to operate on GIT_DIR: GIT_DIR=/path/to/foo.git public-inbox-index This change ensures they keep working, otherwise "." will be passed to the --git-dir= switch of git(1) because that's the default directory if no inboxes are specified on the command-line. Fixes: 9fcce78e40b0a7c6 ("script/public-inbox-*: favor caller-provided pathnames")
2020-12-22support multiple CODE_URLs
public-inbox.org will expire in a few years, so ensure Tor .onions can be known before then.
2020-12-21extsearch*: drop unnecessary path canonicalization
Unlike inboxdir, the canonical-ness of -extindex paths is not relevant at the moment, and may never be relevant at all. So don't mislead others into thinking these paths being canonicalized matters.
2020-12-21searchidx: rename get_val to int_val and return IV
Values can be strings in Xapian, although we currently use integer values exclusively. Give the wrapper a more appropriate name in case we start using string columns. For future-proofing, we'll now return `undef' on missing columns and coerce the return value to an IV (integer value) to save memory, as sortable_unserialise returns a PV (pointer value) scalar despite it existing to support numeric values.
2020-12-21use rel2abs_collapsed when loading Inbox objects
We need to canonicalize paths for inboxes which do not have a newsgroup defined, otherwise ->eidx_key matches can fail in unexpected ways.
2020-12-21isearch: use numeric sort for article numbers
Perl sort is alphabetical by default and Xapian uses numeric document IDs, so sort must be told explicitly to use numeric comparisons even if the scalars are integer values (IV) internally. And eliminate extra hash marks ("#") since they're probably too noisy if there are many IDs. Note: I haven't seen this warning message in syslog, yet :>
2020-12-21inbox: delay ->version detection
Our read-only code won't need to know the version until an inbox is accessed. This is a small step towards eliminating many stat() calls on read-only daemon startup.