about summary refs log tree commit homepage
path: root/lib/PublicInbox/Inbox.pm
DateCommit message (Collapse)
2021-10-16inbox + search: use 5.10.1 and do some golfing
Some yak-shaving while I try to track down other bugs...
2021-10-12msgmap: ->new_file to supports $ibx arg, drop ->new
The original Msgmap->new API was v1-specific and not necessary. The ->new_file API now supports an $ibx object being passed to it, simplify -no_fsync use. It will also make an upcoming change easier...
2021-10-12daemon: unconditionally close Xapian shards on cleanup
The cost of opening a Xapian DB (even with shards) isn't high, so save some FDs and just close it. We hit Xapian far less than over.sqlite3 and we discard the MSet ASAP even when streaming large responses. This simplifies our code a bit and hopefully helps reduce fragmentation by increasing mortality of late allocations.
2021-10-01inbox: keep DB handles if git processes are live
Having git processes outlive DB handles is likely to hurt from a fragmentation perspective if the DB handle needs to be recreated immediately due to a git->cat_async callback. So only unref DB handles when we're sure there's no live git users left, otherwise check the inodes. We'll also avoid needless localization checks in git->cleanup and make the return value more obvious since the pid fields are unconditionally deleted nowadays.
2021-10-01inbox: inline and eliminate git_cleanup
It was probably incorrect to use from max_git_epoch, and it's small enough to inline into do_cleanup. We'll also eliminate the unnecessary deletion of {-altid_map} while we're in the area, since we no longer cache/memoize that. Fixes: 7e5cea05f061e757 ("inbox: rewrite cleanup to be more aggressive")
2021-09-29inbox: do not vivify {-repo_objs} during cleanup
This caused config->repo_objs to not fill in {-repo_objs} properly before starting solver. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87o88cqobd.fsf@kyleam.com/ Fixes: 63d7b8ceee55a34 ("daemons: revamp periodic cleanup task")
2021-09-29inbox: drop memoization/preload, cleanup expires caches
cloneurl, description, and base_url are no longer memoized. The non-$env form of base_url is rare in WWW, and is fast enough to not require memoization. cloneurl and description are now expired during cleanup, allowing admins to change these files without restarting (or SIGHUP). -altid_map is no longer cached nor memoized at all, since the endpoint(s) which hit it seem rarely accessed. nntp_url and imap_url are now cached (instead of memoized) in case an inbox is unvisited for a long time. They remain cached since the truthiness check gets called in every per-inbox HTML page, which can potentially be expensive.
2021-09-29inbox: rewrite cleanup to be more aggressive
Avoid relying on a giant cleanup hash and instead use the new DS->add_uniq_timer API to amortize the pause times associated with having to cleanup many inboxes. We can also use smaller intervals for this, as well. We now discard SQLite DB handles at cleanup. Each of these can use several megabytes of memory, which adds up with hundreds/thousands of inboxes. Since per-inbox access intervals are unpredictable and opening an SQLite handle is relatively inexpensive, release memory more aggressively to avoid the heap having to hit swap.
2021-09-26inbox: cloneurl: avoid undef to hash table value
This saves us some memory for the hash slot in the common case the `cloneurl' file doesn't exist.
2021-09-26search: avoid setting undef hashtable entries
`undef' entries still take up a slot in the hash table, and cause the `exists' check to false-positive in ->cleanup_shards. This should fully fix the (innocuous) messages introduced in commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
2021-09-23daemons: revamp periodic cleanup task
Neither Inboxes nor ExtSearch objects were retrying correctly when there are live git processes, but the inboxes were getting rescanned for search or other reasons. Ensure the scan retries eventually if there's live processes. We also need to update the cleanup task to detect Xapian shard count changes, since Xapian ->reopen is enough to detect any other Xapian changes. Otherwise, we just issue an inexpensive ->reopen call and let Xapian check whether there's anything worth reopening. This also lets us eliminate the Devel::Peek dependency.
2021-09-23gcf2 + extsearch: check for unlinked files on Linux
Check for unlinked mmap-ed files via /proc/$PID/maps every 60s or so. ExtSearch (extindex) is compatible-enough with Inbox objects to be wired into the old per-inbox code, but the startup cost is projected to be much higher down the line when there's >30K inboxes, so we scan /proc/$PID/maps for deleted files before unlinking. With old Inbox objects, it was (and is) simpler to just kill processes w/o checking due to the low startup cost (and non-portability of checking). Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20210921144754.gulkneuulzo27qbw@meerkat.local/
2021-09-22inbox: do not waste hash slot on httpbackend_limiter
A few dozen bytes saved here can add up when we have thousands of inboxes. It also makes Data::Dumper debug output a bit cleaner.
2021-09-16www: support publicinbox.imapserver
This allows PublicInbox::WWW hosts to advertise the existence of IMAP servers in addition to NNTP servers.
2021-09-16inbox: streamline ->nntp_url
We no longer waste a precious hash slot for a per-Inbox {nntpserver} if it's only configured globally for all inboxes.
2021-08-28www: avoid potential auto-vivification on ibx->{url}
This may fix problems with the "all" link disappearing. Link: https://public-inbox.org/meta/CAMwyc-Tw=v5yT1U1U66GSwwTK8OJXv8_YDu-=oXbZO3tHSnYWw@mail.gmail.com/
2021-08-25imap+nntp: die loudly if ->mm or ->over disappear
While the WWW front-end can gracefully handle ->mm and ->over disappearing (in most cases), IMAP+NNTP front-ends are completely dependent on these and failed mysteriously when they go missing after startup. These will hopefully make issues like what Konstantin encountered more obvious: Link: https://public-inbox.org/meta/20210824204855.ejspej4z7r2rpu63@nitro.local/
2021-05-05lei rediff: regenerate diffs from stdin
Sometimes a mailed patch is generated with non-ideal output, (lacking context, noisy whitespace changes, etc.), or a user wants to use the same external diff viewer they've configured git to use. Since we have SolverGit to regenerate arbitrary blobs from patches; this new command allows us to regenerate a diff with different options using the blobs SolverGit gives us. The amount of git-diff(1) options is mind numbing, so it's likely I missed some favorites or botched the getopt spec translation. This also fixes Inbox::base_url to check psgi.url_scheme before attempting to generate URLs and avoid uninitialized variable warnings. Oddly, the "lei blob" tests did not trigger these uninitialized warnings. Note: this will automatically import+index the message(s) it's regenerating, because solver relies on being able to lookup pre/postimage OIDs and read blobs.
2021-03-02inbox: ->mailboxid accessor
This will be necessary for "mailboxIds" as described in RFCs 8620 and 8621 (for JMAP). We may implement "MAILBOXID" in RFC 8474 for IMAP, as well.
2021-02-04tests: guard against missing DBD::SQLite
The features we use for SharedKV could probably be implemented with GDBM_File or SDBM_File, but that doesn't seem worth it at the moment since we depend on SQLite elsewhere.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-28check defined return value for localized slurp errors
Reading from regular files (even on STDIN) can fail when dealing with flakey storage.
2020-12-26inbox: name variable for values loop iterator
->on_inbox_unlock callbacks could clobber $_, and this seems to fix a problem with -extindex --watch failing to index some inboxes after SIGHUP reload.
2020-12-23inbox: git_epoch: correct false comment
The original comment hasn't been true since PublicInbox::Git->modified was changed to use cat_async blob responses. In any case, manifest.js.gz generation already cleans up per-epoch git processes used for ->modified.
2020-12-21inbox: delay ->version detection
Our read-only code won't need to know the version until an inbox is accessed. This is a small step towards eliminating many stat() calls on read-only daemon startup.
2020-12-17inbox: simplify v2 epoch counting
Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-17inbox: ->uidvalidity returns undef w/o ->mm
While totally unindexed inboxes are rare, we still support them for v1 and may hit code which calls this method. Just return `undef' when ->mm access fails.
2020-12-14PublicInbox::Feed owns `feedmax' default value
There's no need to have extra code in the Inbox package for this or to waste dozens of bytes for every Inbox object which uses the default value. This makes our code more flexible w.r.t Inbox-like ExtSearch objects and fixes uninitialized value warnings with ->ALL.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05inbox: simplify ->search and callers
Stop leaking WWW/PSGI-specific logic into classes like PublicInbox::Inbox, which is used universally. We'll also decouple $ibx->over from $ibx->search and just deal with duplicate the code inside ->over to reduce argument complexity in ->search. This is also a step in moving away from using {psgi.errors} to ease code sharing between IMAP, NNTP, and command-line interfaces. Perl's built-in `warn' and `local $SIG{__WARN__}' provides all the flexibility we need to control warning output and should be universally understood by Perl hackers who may be unfamiliar with PSGI.
2020-11-24manifest: support faster generation via [extindex "all"]
For a mirror of lore.kernel.org with >140 inboxes, this speeds up manifest.js.gz generation from ~1s to 40ms on my HW. This is still unacceptable when dealing with thousands of inboxes, but gets us closer to where we need to be.
2020-11-24inbox: git_epoch: remove ->version check
If $epoch is supplied to this method, there's already epochs and an extra method call for ->version is a pointless waste of CPU cycles.
2020-11-24manifest: use ibx->git_epoch method for v2
We can slightly reduce the amount of version-specific logic, here.
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-11-07extsearch: wire up smsg_eml
We'll probably still need synchronous message retrieval in a few places (tests, at least).
2020-10-16inbox: add uidvalidity method
This will make it easier to deal with ExtSearchIdx, which won't have msgmap.
2020-09-10use "\&" where possible when referring to subroutines
"*foo" is ambiguous in that it may refer to a bareword file handle; so we'll use it where we can without triggering warnings. PublicInbox::TestCommon::run_script_exit required dropping the prototype, however. We'll also future-proof by dropping "use warnings" in Cgit.pm and use the less-ambiguous "//=" in Inbox.pm while we're in the area.
2020-09-03search: remove {over_ro} field
Only inbox accesses the read-only {over}, now, instead of going through ->search. This simplifies our object graph and avoids potentially redundant FDs and DB handles pointing to the same over.sqlite3 file.
2020-08-27over: rename ->connect method to ->dbh
`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-08-20www: reduce long-lived PublicInbox::Search references
While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results.
2020-07-14over+msgmap: do not store filename after DBI->connect
SQLite already knows the filename internally, so avoid having it as a long-lived Perl SV to save some bytes when there's many inboxes and open DBs.
2020-07-14nntpd+imapd: detect unlinked msgmap
While it's even less common to experience a replaced msgmap.sqlite3 file, BOFHs may do the darndest things. This is another step towards reducing the number of needless wakeups we need to do in long-lived read-only daemons.
2020-06-28inbox: warn on ->on_inbox_unlock exception
Otherwise, we may never know what went wrong.
2020-06-13nntpd+imapd: detect replaced over.sqlite3
For v1 inboxes (and possibly v2 in the future, for VACUUM), public-inbox-compact replaces over.sqlite3 with a new file. This currently doesn't need an extra inotify watch descriptor (or FD for kevent) at the moment, so it can coexist nicely for systems w/o IO::KQueue or Linux::Inotify2.
2020-06-13git: move async_cat reference to PublicInbox::Git
Trying to avoid a circular reference by relying on $ibx object here makes no sense, since skipping GitCatAsync::close will result in an FD leak, anyways. So keep GitAsyncCat contained to git-only operations, since we'll be using it for Solver in the distant feature.
2020-06-13imap: use git-cat-file asynchronously
This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.
2020-06-13inboxidle: new class to detect inbox changes
This will be used to implement IMAP IDLE, first. Eventually, it may be used to trigger other things: * incremental internal updates for manifest.js.gz * restart `git cat-file' processes on pack index unlink * IMAP IDLE-like long-polling HTTP endpoint And maybe more things we haven't thought of, yet. It uses Linux::Inotify2 or IO::KQueue depending on what packages are installed and what the kernel supports. It falls back to nanosecond-aware Time::HiRes::stat() (available with Perl 5.10.0+) on systems lacking Linux::Inotify2 and IO::KQueue. In the future, a pure Perl alternative to Linux::Inotify2 may be supplied for users of architectures we already support signalfd and epoll on. v2 changes: - avoid O_TRUNC on lock file - change ctime on Linux systems w/o inotify - fix naming of comments and fields
2020-06-03www: remove smsg_mime API and adjust callers
To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-06-03inbox: msg_by_*: remove $(size)ref args
None of our current callers care about the size of the blob we're retrieving, so stop wasting stack space and code for it.