about summary refs log tree commit homepage
path: root/lib/PublicInbox
DateCommit message (Collapse)
2019-06-04feed: only accept ASCII digits for ref~$N
We don't want to waste cycles passing non-ASCII characters to git.
2019-06-04mid: id_compress requires ASCII-clean words
Its result is used for HTML anchors and such.
2019-06-04nntp: ensure we only handle ASCII whitespace
RFC3977 does not have provisions for whitespace beyond ASCII TAB, SP, CR and LF. I doubt there's any NNTP clients broken enough to be sending non-ASCII whitespace delimiters. We're probably excessively liberal regarding TAB acceptance, even; but it's probably too late to change at this point...
2019-06-04nntp: be explicit about ASCII digit matches
We aren't able to make sense of non-ASCII digits cf. perlrecharclass(1) / "Digits" section
2019-06-04linkify: support Internationalized Domain Names in URLs
The "\w" character class in Perl matches any word characters in the Unicode database, not just ASCII characters. So we must be prepared for that and generate links to IDNs.
2019-06-03ds: remove PLCMap and per-socket PostLoopCallback
We don't need and won't be needing per-socket PostLoopCallbacks.
2019-06-02ds: drop write_set_watch field
We never enable write watches ourselves for HTTP and NNTP, and only enable the write watch with EvCleanup because it's an "always on" watch.
2019-06-02ds: drop unused EVENT: label in epoll code path
This was never used in Danga::Socket 1.61, either.
2019-06-02ds: drop checks for invalid descriptors
I've used Danga::Socket for well over a decade in various projects at this point and have never seen the need for it. If such a bug ever happens; the process should fall over so it gets fixed ASAP.
2019-06-02ds: drop set_writer_func support
This is not used by perlbal for OpenSSL support, either; and it does not appear to be the right layer for doing write translations anyways (IO::Socket::SSL uses `tie').
2019-06-02ds: add a note about planned future changes
Sometimes I get bored with the email part of this project and need a distraction :P
2019-06-02ds: drop more unused subs
ToClose and HaveEpoll are of no use to us and I see no future use for them, either.
2019-06-01ds: fix and test for FD leaks with kqueue on ->Reset
Even though we currently don't use it repeatedly, ->Reset should close() kqueue FDs and not cause the process to run out of descriptors. Add a close-on-exec test while we're at it.
2019-06-01ds: set close-on-exec flag on epoll descriptors
We should not be leaking these FDs to git(1) processes, in case git has a bug that causes it to access the wrong FD.
2019-06-01git: drop the deleted err_c file
No reason to leave that (usually) empty file open after killing off "cat-file --batch-check". This wasn't an unbound leak, though, as respawning the --batch-check process would've clobbered the old err_c file.
2019-06-01git: unconditional expiry
A constant stream of traffic to either httpd/nntpd would mean git-cat-file processes never expire. Things can go bad after a full repack, as a full repack will unlink old pack indices and git-cat-file does not currently detect unlinked files. We could do something complicated by recursively stat-ing objects/pack of every git directory and alternate; but that's probably not worth the trouble compared to occasionally restarting the cat-file process. So simplify the code and let httpd/nntpd expire them periodically, since spawning a "git-cat-file --batch" process isn't too expensive. We already spawn for every request which hits git-http-backend, cgit, and git-apply. In the future, we may optionally support the Git::Raw module to avoid IPC; but we must remain careful to not leave lingering FDs open to unlinked files after repack.
2019-05-31viewdiff: avoid repeat variable expansion
This is worth a 1-2% speedup in t/perf-msgview.t rendering 2620 messages currently in https://public-inbox.org/meta/
2019-05-30v2writable: short-circuit is_ancestor check on equality
We don't need to use git to check ancestry if object IDs match on a string comparison. This saves 100ms or so and brings down the ~0.5s no-op time on lore.kernel.org/lkml down to ~0.4s.
2019-05-30v2writable: avoid mm_tmp creation without regen
Creating mm_tmp is an expensive operation with large inboxes and can be avoided if there are no new messages to process. Since git-fetch(1) currently lacks an --exit-code option(*), mirrors will run `public-inbox-index' unconditionally after fetch, which is an expensive op if it needs to duplicate a large SQLite DB. This speeds up the mirror case of: git --git-dir=git/$EPOCH.git fetch && public-inbox-index This reduces the no-op `public-inbox-index' time from over 8s to ~0.5s on a (currently) 7-epoch clone of https://lore.kernel.org/lkml/ on my system. (*) WIP --exit-code for git-fetch: https://public-inbox.org/git/87ftphw7mv.fsf@evledraar.gmail.com/
2019-05-30v2writable: hoist out index_epoch sub
This will make future changes easier-to-follow.
2019-05-30v2writable: split off unindex_range mapping
It'll make it easier to detect if we have anything to unindex and run git-log on, at all.
2019-05-29searchidx: store indexlevel=medium as metadata
And use it from Admin. It's easy to tell what indexlevel=basic is from unconfigured inboxes, but distinguishing between 'medium' and 'full' would require stat()-ing position.* files which is fragile and Xapian-implementation-dependent. So use the metadata facility of Xapian and store it in the main partition so Admin tools can deal better with unconfigured inboxes copied using generic tools like cp(1) or rsync(1).
2019-05-29v2writable: show progress updates for index_sync
We can show progress whenever we commit changes to the FS.
2019-05-29index: support --verbose option
It doesn't implement progress of batches, yet, but it wires up the parsing of the command-line while preserving output compatibility. This output is NOT meant to be stable.
2019-05-29v2writable: move index_sync options to sync state
And use singular `opt' to be consistent with the common name of 'getopt'.
2019-05-29v2writable: use prototypes for internal subs
Hopefully this improves maintainability by allowing Perl to do some arg checking for us.
2019-05-29v2writable: localize unindex-range.$EPOCH to $sync state
We don't need to stuff that into $self (V2Writable) which can be longer-lived than a ->index_sync invocation.
2019-05-29v2writable: move {ranges} into $sync state
Yet another temporary variable with no use outside of index_sync.
2019-05-29v2writable: move {regen} into $sync state
regen is always enabled for index_sync nowadays (and has been for a while). Rename `index_prepare' to `sync_prepare' to show it's for ->index_sync; and not the online indexing we do for ->add.
2019-05-29v2writable: move {reindex} field to $sync state
reindexing info is not used outside of the index_sync code path.
2019-05-29v2writable: sync: move delete markers into $sync state
Another small step to reduce parameters passed to reindex_oid.
2019-05-29v2writable: introduce $sync state and put mm_tmp in it
A first step towards making the v2 index_sync code easier-to-follow. More fields to follow...
2019-05-27v2: fix reindex skipping NNTP article numbers
`public-inbox-index --reindex' could cause NNTP article number gaps to form when it also has to deal with new, never-before-seen commits in mirrors running off `git fetch'. Fix this by running two distinct invocations of ->index_sync; once to only reindex old commits, and a second time to index new commits. This does not appear to be a problem on v1 at the moment, but I'll need more time to analyze this.
2019-05-27searchidx: fix obvious typo
We can't pass an empty string to `git merge-base --is-ancestor' AFAIK, this did NOT present issues in the current test suite.
2019-05-26viewvcs: keep temporary Solver dir for large streams
Streaming large blobs can take multiple iterations of the event loop in our -httpd; so we must not let the File::Temp::Dir result go out-of-scope when streaming large blobs created from patches.
2019-05-25v2writable: fix assertions around reindexing
Fix a misspelling and ensure line context is printed by `die' by leaving out the final '\n'. Also, `delete' was pointless.
2019-05-25contrib/css: mark as CC0 (public domain)
No reason to copyright colour schemes :P
2019-05-25v2writable: drop unused $last_commits var
Apparently it's never been used and we write to msgmap directly.
2019-05-25msgmap: remove double negative
I have never not found double negatives to be confusing...
2019-05-24search: don't log all warnings on retry_reopen
Some users (or bots :P) can trigger horrible queries which the caller can choose to either log or ignore. This prevents horrible queries from ExtMsg from logging confusing "ref: " messages when $@ is not a Perl reference.
2019-05-23doc: various updates to reflect current state
-index documentation avoid redundant v1 information and refers readers to apropriate v1/v2 manpages. Search::Xapian can also be optional, now, as only the PSGI search interface uses it. Favor "INBOX_DIR" where appropriate, since "REPO_DIR" can be confused for code repos which we also support. XAPIAN_FLUSH_THRESHOLD is documented for all relevant bulk commands.
2019-05-23xapcmd: do not reset %SIG until last Xtmpdir is done
To properly handle compact tmpdir cleanup in single process situations, we need to carefully account for Xtmpdir not being a singleton and ensuring we don't clobber signal handlers which belong to other Xtmpdirs.
2019-05-23xcpdb|compact: support --jobs/-j flag like gmake(1)
We don't have to be tied to the number of partitions in case we made a bad choice at initialization. This doesn't affect reindexing, but the copying phase is already intensive. And optimize away the extra process when we only have a single job which won't parallelize. The wording for the (v2) reindexing phase could be improved, later. I also plan to allow repartitioning of existing Xapian DBs.
2019-05-23xapcmd: cleanup on interrupted xcpdb "--compact"
We should not have leftover junk on interrupted invocations.
2019-05-23xcpdb|compact: support some xapian-compact switches
Allow users to specify the --blocksize <B>, --no-full, --fuller options for xapian-compact(1) for fine-tuning compact behavior for low-traffic/inactive inboxes. We also won't support --multipass, since it doesn't seem compatible with our requirement to use --no-renumber. We also won't support --single-file, since it only seems intended for totally dead inboxes; and it doesn't seem worth the support overhead when "totally dead" turns out to be a misdiagnosis.
2019-05-23compact: reuse infrastructure from xcpdb
Since -xcpdb is a superset of -compact, we can reuse much of that code used for driving compact. For compact (only), this is slightly less memory efficient since it requires an extra process per-partition, but we get to prefix the output with the partition name for more readable output.
2019-05-23xcpdb: remove temporary directories on aborts
Cleanup temporary directories on common termination signals (INT, HUP, PIPE, TERM), but only if it's not in the process of being committed via rename() sequence.
2019-05-23xcpdb: show re-indexing progress
Emit information about reindexing git revision ranges when used with xcpdb. Additionally, distinguish Xapian copy output from v2 git epoch counting by increasing directory context info. For now, v1 batches batches are emitted. v2 indexing is still missing progress reporting for batches, as the data structures for reindexing would benefit from a refactoring, first. This does not currently affect the use of public-inbox-index, but may in the future.
2019-05-23xapcmd: use "print STDERR" for progress reporting
`warn' is reserved for actual warnings, as it respects $SIG{__WARN__} and we rely on that override to print message context information when we are indexing.
2019-05-23xapcmd: avoid EXDEV when finalizing changes
By creating temporary directories as deep as possible, we can allow v2 repositories to have `xap$SCHEMA_VERSION' (e.g. `xap15') reside on a separate FS. We also check st_dev ahead-of-time to avoid doing work which will fail with EXDEV. Of course, another process may still move/change things around.