about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2020-11-28nntp: NEWNEWS: speed up filtering
With 50K newsgroups, the filtering phase goes from ~2000 seconds to ~90 MILLISECONDS by relying on the grep perlop. This moves ->over checking out of the main dispatch and amortizes the cost via long_response. (Fairly scheduled) long_response time in newnews_i now takes ~360 seconds as opposed to ~30 seconds before this change, however; but the initial filtering speedup eliminating 2000s is more than worth it.
2020-11-28nntp: use grep operation for wildmat matching
Based on experiences with the IMAP server, this ought to be significantly faster (as to be demonstrated in the next commit).
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-28nntpd: share {groups} hash with {-by_newsgroup} in Config
There's no need to duplicate a potentially large hash, but we can keep the inexpensive shortcut to it. We may eventually drop the {groups} shortcut if it's no longer useful.
2020-11-28nntp: use Inbox->uidvalidity instead of ->mm->created_at
This is memoized, and may allow us some future flexibility w.r.t PublicInbox::Inbox-like objects. While we're at it, use defined-or ("//") in case somebody really set a public-inbox creation time to the Unix epoch.
2020-11-24extsearchidx: deduplicate alternates based on st_dev + st_ino
This allows us to filter out duplicate alternates entries in case there's symlinks or bind mounts in play, as I (and perhaps some other users) tend to use symlinks and/or bind mounts heavily.
2020-11-24wwwattach: prevent deep-linking via Referer match
This prevents `<img src=' tags from being used to deep-link image attachments from HTML outside of the current host and reduces potential for abuse. Some browsers (e.g. Firefox) favor content detection and will display images irrespective of the Content-Type header being "application/octet-stream", and "Content-Disposition: attachment" doesn't stop them, either. Tested with dillo and Firefox. Reported-by: Leah Neukirchen <leah@vuxu.org>
2020-11-24gcf2: workaround libgit2 alternates bug for extindex
While libgit2 handles alternates with relative paths properly for v2 epochs; nesting them another layer with extindex uses the wrong relative path expansion (and is inconsistent with git(1) behavior). Fortunately, it's possible to work around this libgit2 bug entirely within Gcf2 and avoid further special cases throughout the rest of our code to support extindex. Link: https://bugs.debian.org/975607
2020-11-24*search: simplify retry_reopen users
Every callback uses `$self', and creating short-lived array references is not necessary when it's just as easy to copy the array in Perl (unlike C).
2020-11-24manifest: support faster generation via [extindex "all"]
For a mirror of lore.kernel.org with >140 inboxes, this speeds up manifest.js.gz generation from ~1s to 40ms on my HW. This is still unacceptable when dealing with thousands of inboxes, but gets us closer to where we need to be.
2020-11-24extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
This was intended to make development easier; but also allows us description, URL, and address changes to be picked up independently of message history.
2020-11-24miscidx: store absolute git_dir of each epoch in docdata
This will make it possible to map reference repos in case somebody uses the feature.
2020-11-24miscidx: cleanup git processes after manifest indexing
We shouldn't leave "cat-file --batch" processes around when we're done with an epoch or inbox, since there could be many thousands.
2020-11-24extsearch: fix remaining "eindex" references
We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible line and also some hacker-only tests. "eindex" is no longer used because it rhymes with "reindex", so remove the last instance of it. Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")
2020-11-24miscidx: put grokmirror manifest entries in Xapian docdata
This should make it possible for us quickly generate manifest.js.gz files with less random I/O and process spawning in the WWW code.
2020-11-24inbox: git_epoch: remove ->version check
If $epoch is supplied to this method, there's already epochs and an extra method call for ->version is a pointless waste of CPU cycles.
2020-11-24manifest: use ibx->git_epoch method for v2
We can slightly reduce the amount of version-specific logic, here.
2020-11-24git: add manifest_entry method
We'll be using this for MiscIdx and pre-generating the necessary JSON for manifest.js.gz, so make it easier to share code for generating per-repo JSON entries for grokmirror.
2020-11-24move JSON module portability into PublicInbox::Config
We'll be using JSON in MiscIdx and MiscSearch, and PublicInbox::Config seems like an appropriate place to put it.
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-17v2writable: avoid initiating leftover unindex if interrupted
We can also avoid a needless progress message on log2stack interruptions, too.
2020-11-15searchidx: check for graceful shutdown in log2stack
The initial "git log" invocation for a git epoch can be time consuming, so check for graceful shutdown at each line to ensure timely shutdowns and avoid SSD/HDD wear.
2020-11-15extindex: support graceful shutdown via QUIT/INT/TERM
Just like the daemon processes, -extindex now supports graceful shutdown via the same signals. This lets users avoid having to repeat indexing messages when a power outage strikes during a long (multi-hour/day) indexing run. Per-inbox (v1/v2) -index graceful shutdowns are not supported, yet, but is planned for later.
2020-11-15*index: discard sync->{todo} on iteration
There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15*index: avoid per-epoch --batch-check processes
Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-10searchidx: fix fallback on unindex miss
In case of other bugs or intentional corruption of over.sqlite3, we don't want to attempt dereferencing a non-ref scalar when calling ->mid_delete in the fallback code path. Noticed while chasing another bug in extindex development...
2020-11-08extindex: SIGUSR1 supports checkpoint
Matching the behavior of git-fast-import(1), we'll allow a user to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
2020-11-08v2writable: more accurate {current_info} warnings/progress
With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08extsearch: canonicalize topdir
This makes `ps' output look a bit nicer if there's trailing slashes involved from the command-line.
2020-11-08extsearchidx: quiet warning for unindexed `d' messages
"deleted" messages (via -learn <spam|rm>) in the source inboxes are likely to already be unindexed, so avoid triggering needless warnings about the spam message being missing.
2020-11-08v2writable: less expensive checkpoint for extindex
Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.
2020-11-08searchidxshard: further improve {current_info} readability
Add a space after \0 to visually disambiguate it from the {bytes} field.
2020-11-08searchidxshard: reduce syscalls when writing ->eidx_key
We use ->autoflush(1) on this pipe to ensure the shard workers see data immediately on print; so this means we have to do our own buffering for optional data.
2020-11-08extsearchidx: avoid needless alternates rewrite in ALL.git
As with fill_alternates in V2Writable, we do not need to update $GIT_DIR/objects/info/alternates if nothing is changed.
2020-11-08extsearch: rename -eindex to -extindex
Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07searchidxshard: make warnings with eidx_key less confusing
Seeing "Xorg.foo.bar" can be confusing in warnings if the eidx_key is only "org.foo.bar" with no relation to "Xorg" at all. Furthermore, printing "\0" to log or terminal output isn't very nice and could throw off some users/tools.
2020-11-07extsearchidx: support --batch-size checkpoints
This is needed to limit the RSS of processes and ensure the stored data in over.sqlite3 and Xapian DBs are consistent if interrupted. Without checkpoints, indexing lore causes shard workers to take several GB of memory and thrash/OOM smaller systems.
2020-11-07extsearchidx: set current_info in warning callbacks
This bit is duplicated with per-Inbox indexing in Admin, undecided if it's the right place for it.
2020-11-07searchidx: ignore exceptions from ->remove_term
This seems necessary for some cross-posted messages (and we did it historically before we used over.sqlite3).
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07extsearch: wire up smsg_eml
We'll probably still need synchronous message retrieval in a few places (tests, at least).
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07searchidx: favor $sync->{ibx} (over $self->{ibx})
In case we want to reuse code with ExtSearchIdx or V2Writable.
2020-11-07searchidx: reduce inbox-dependency, wrap ->with_umask
This will let us work consistently with both existing inboxes and external indices.
2020-11-07extsearchidx: sync updates
A couple of more things to prepare us to run syncs on both v1 and v2 inboxes.
2020-11-07searchidx: export prepare_stack
We'll be needing it in ExtSearchIdx for the next commit.
2020-11-07extsearchidx: sync unit updates
Now that the V2Writable code is more generic, we can sync with it to use `units' which represent either a v2 epoch or an entire v1 inbox.