about summary refs log tree commit homepage
path: root/script
DateCommit message (Collapse)
2024-04-03treewide: avoid getpid() for OnDestroy checks
getpid() isn't cached by glibc nowadays and system calls are more expensive due to CPU vulnerability mitigations. To ensure we switch to the new semantics properly, introduce a new `on_destroy' function to simplify callers. Furthermore, most OnDestroy correctness is often tied to the process which creates it, so make the new API default to guarded against running in subprocesses. For cases which require running in all children, a new PublicInbox::OnDestroy::all call is provided.
2023-11-29doc: fix a few typos and wording issues
2023-11-29doc: -cindex: correct and unify -g GIT_DIR usage string and man page
Fixes: c76a20d75200 ("cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'")
2023-11-29cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
Accepting @ARGV without switches ends up being ambiguous with optional parameters for --join and --show. Requiring users to specify `--join=' or `--show=' is a bit awkward (as it with -clone --objstore= and the like, but that is historical baggage we need to carry at this point...)
2023-11-21cindex: rename --associate to --join, test w/ real repos
The association data is just stored as deflated JSON in Xapian metadata keys of shard[0] for now. It should be reasonably compact and fit in memory for now since we'll assume sane, non-malicious git coderepo history, for now. The new cindex-join.t test requires TEST_REMOTE_JOIN=1 to be set in the environment and tests the joins against the inboxes and coderepos of two small projects with a common history. Internally, we'll use `ibx_off', `root_off' instead of `ibx_id' and `root_id' since `_id' may be mistaken for columns in an SQL database which they are not.
2023-11-13cindex: support --associate-aggressive shortcut
This is shorthand to enabling --associate with the most aggressive (and time-consuming) options available, starting from the Unix epoch and having an unlimited window to join on.
2023-11-13cindex: rename associate-max => window
"window" is probably a better term since it's an inexact thing to match on.
2023-11-13treewide: update read_all to avoid eof|close checks
read_all can be expanded to support FIFOs/pipes/sockets where read-until-EOF behavior is desired. We can also rely on wantarray to support splitting on EOL markers, but it's hard-coded to support only `$/ eq "\n"' since (AFAIK) it's the only way we use the wantarray form `readline'.
2023-11-11mda: fix and test some usage problems
-mda now honors `--help' properly and invocations missing ORIGINAL_RECIPIENT now fail with EX_NOUSER. Helped-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87msvlguqu.fsf@vuxu.org/
2023-11-11mda|learn|watch: support dropUniqueUnsubscribe config
List-Unsubscribe headers with unique identifiers (such as those generated by our examples/unsubscribe.milter) should not end up in public archives. Add a new config knob to strip List-Unsubscribe headers if they have the `List-Unsubscribe-Post: List-Unsubscribe=One-Click' header. Unfortunately, this breaks DKIM signatures if the signature covers either of these List-Unsubscribe* headers. However, breaking DKIM is the lesser evil compared to any archive reader being able to stop archival by an independent archivist. As much as I would like this to be the default, it probably affects few users at the moment since very few mailing lists use unique identifiers in List-Unsubscribe (but that number has grown, recently).
2023-11-11learn: fix redundant ham import on dual matches
When learning and injecting new messages ham, we want to avoid wasting cycles importing the same message into an inbox twice (once for the To/Cc match and once for the List-Id match). Our existing %seen hash turned out to be ineffective since PublicInbox::Inbox refs get re-blessed to PublicInbox::InboxWritable. So we stop letting class name influence the hash key for tracking by using the reference address instead. We can get the reference address by performing an arithmetic operation (+ 0) instead of having to pay the cost of importing Scalar::Util::refaddr.
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-11-03treewide: use ->close to call ProcessIO->CLOSE
This will open the door for us to drop `tie' usage from ProcessIO completely in favor of OO method dispatch. While OO method dispatches (e.g. `$fh->close') are slower than normal subroutine calls, it hardly matters in this case since process teardown is a fairly rare operation and we continue to use `close($fh)' for Maildir writes.
2023-10-18convert: use read_all to simplify error checks
There wasn't a need to loop anyways with Perl `read' since the default PerlIO layer will retry.
2023-10-18init: use autodie to reduce distractions
This hurts startup time a bit, but our tests use run_script by default and I don't think normal users call -init enough to care.
2023-10-18init: drop extraneous `+'
It's actually valid Perl syntax, but still confusing to look at. Fixes: add90b9504f4 ("support -C (chdir) for most non-daemon commands")
2023-10-18use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads. This saves us some code to reduce cognitive overhead for readers. We'll also support reusing a destination buffers so it can work more nicely with existing code.
2023-10-15learn: respect indexlevel for v1 inboxes
v2 never suffered from this bug, apparently, but -learn didn't seem able to handle indexlevel=basic (nor respect `medium') for v1 inboxes. I only noticed this bug because I converted some ancient v1 inboxes to `basic' to save space.
2023-10-11import: switch to Unix stream socket for fast-import
We use fewer file descriptors and fewer lines of code this way. I'm not aware of any place we rely on POSIX pipe semantics with `git fast-import', and sockets have bigger buffers by default in most cases (even if Linux allows larger pipe buffers).
2023-10-11treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a (mda: drop leading "From " lines again, 2016-06-26)), we'll always have to be dealing with mutt piping messages to us and `git format-patch' output. So just share the regexp so we can use it everywhere. In may be desirable to allow importing messages with a leading "From " line for FUSE, even. Additionally, some instances of this regexp needlessly added optional `\r?' (CR) checks ahead of the `\n' (LF) element; but they're pointless anyways since [^\n]* is enough to exclude all non-LF bytes.
2023-10-06ipc: lower-level send_cmd/recv_cmd handle EINTR directly
This ensures script/lei $send_cmd usage is EINTR-safe (since I prefer to avoid loading PublicInbox::IPC for startup time). Overall, it saves us some code, too.
2023-10-01treewide: enable warnings in all exec-ed processes
While forked processes inherit from the parent, exec-ed processes need the `-w' flag passed to them. To determine whether or not we should pass them, we must check the `$^W' global perlvar, first. We'll also favor `perl -e' over `perl -E' in places where we don't rely on the latest features, since `-E' incurs slightly more startup time overhead from loading feature.pm (while `perl -Mv5.12' does not).
2023-09-28convert: use ProcessPipe with popen_rd
ProcessPipe->CLOSE will already run waitpid for us and exit on errors, so we can do less, here.
2023-09-11treewide: favor Xapian (SWIG binding) over Search::Xapian
The Xapian SWIG bindings are favored by Xapian upstream for ease-of-maintenance compared to the XS version. While Debian lags on this front, the SWIG bindings are widely available on all *BSDs.
2023-09-08watch: reset HUP + USR1 signal handlers in children
Child processes handling IMAP/NNTP aren't going to want to handle config reloads nor forced rescans, those are exclusively for the parent. We'll leave a note that QUIT/TERM/INT can safely use the same callback for both parent and children, as I nearly made the mistake of resetting those to their default values in the child.
2023-09-08watch: set %SIG for non-signalfd/kqueue
We need to ensure there isn't a window where we lose $SIG{CHLD} handling. This is the second part in getting t/imapd.t to pass the reload-after-setting-imap.pollInterval test That said, I'm not entirely happy with the way -watch jumps in and out of the event loop. It's historical baggage from the pre-event_loop days.
2023-09-05watch: ensure children can use signal handlers
Blindly using the signal set inherited from the parent process is wrong, since the parent (or grandparent) could've blocked all signals. Ensure children can process signals in the event loop when sig handlers have to use standard Perl facilities.
2023-08-30treewide: drop MSG_EOR with AF_UNIX+SOCK_SEQPACKET
It's apparently not needed for AF_UNIX + SOCK_SEQPACKET as our receivers never check for MSG_EOR in "struct msghdr".msg_flags anyways. I don't believe POSIX is clear on the exact semantics of MSG_EOR on this socket type. This works around truncation problems on OpenBSD recvmsg when MSG_EOR is used by the sender. Link: https://marc.info/?i=20230826020759.M335788@dcvr
2023-08-28public-inbox-init: honor umask when creating config file
Creating config 0600 disregarding umask breaks scenarios where daemons run with credentials different from config owner (but need to read the config). File::Temp defaults to 0600, which is unsuitable for the recommended/typical scenario of daemons running unprivileged and with UID different from $PI_CONFIG owner, as the deamons need to read $PI_CONFIG. Respecting umask might end up creating world-unreadable config, too, but for people who use such umask that's expected behavior.
2023-08-28Fix some typos/grammar/errors in docs and comments
2023-08-24cindex: add --show-roots switch
This aids in development, but I'm not sure it's going to stay or be moved into another interface.
2023-08-24cindex: read-only association dump
This will eventually allow associating coderepos with inboxes and vice-versa; avoiding the need for manual configuration via tedious publicinbox.*.coderepo directives. I'm not sure how this should be stored for WWW, yet, but it's required since it takes about 8 hours to do this fully across lore and git.kernel.org.
2023-05-04xcpdb: support cindex upgrades and resharding
xcpdb is necessary for upgrading Xapian backends (e.g. glass to honey), thus codesearch indices (cindex) must be supported. Resharding is also useful if CPU count is altered on system upgrades or downgrades. cindex Xapian sharding is completely different than anything else we do, so the resharding operation must be a special case based on existing cindex sharding rules.
2023-05-03compact: support codesearch indices
This is much easier to support than xcpdb since it's 1:1 and doesn't follow a different sharding scheme than the inboxes and extindices.
2023-04-22cindex: rewrite prune (again) for speed
With my partial git.kernel.org mirror, this brings a full prune down from ~75 minutes to under 5 minutes using git 2.19+. This speedup even applies to users on slow storage (rotational HDD). First off, xapian-delve(1) is nearly 10x faster for dumping boolean terms by prefix than the equivalent Perl code with Xapian bindings. This performance difference is critical since we need to check over 5 million commits for pruning a partial git.kernel.org mirror. We can use sed(1) and sort(1) to massage delve output into something suitable for the first comm(1) input. For the second comm(1) input, the output of `git cat-file --batch-check --batch-all-objects' against all indexed git repos with awk(1) filtering provides the necessary output for generating a list of indexed-but-no-longer accessible commits. sed(1) and awk(1) are POSIX standard tools which can be roughly 2x faster than equivalent Perl for simple filters, while sort(1) is designed to handle larger-than-memory datasets efficiently (unlike the `sort' perlop). With slow storage and git <2.19, the switch to --batch-all-objects actually results in a performance regression since having git perform sorting results in worse disk locality than the previous sequential iteration by Xapian docid. git 2.19+ users with `--unordered' support benefits from improved storage locality; and speedups from storage locality dwarfs the extra overhead of an extra external sort(1) invocation. Even with consumer-grade SATA-II SSDs, the combo of --unordered and sort(1) provides a noticeable speedup since SSD latency remains a factor for --batch-all-objects. git <2.19 users must upgrade git to get acceptable performance on slow storage and giant indexes, but git 2.19 was released nearly 5 years ago so it's probably a reasonable requirement for performance. The only remaining downside of this change for all users the extra temporary disk space for sort(1) and comm(1); but the speedup provided with git 2.19+ is well worth it.
2023-04-07umask: rely on the OnDestroy-based call where applicable
This lets us get rid of some awkwardness around the old API and single-use subroutines while saving us some LoC.
2023-04-06watch: use detect_indexlevel for unconfigured inboxes
I favor leaving the publicinbox.<name>.indexlevel parameter out of config files to make it easier to alter and reduce sources of truth. It worked well in most cases, but public-inbox-watch also needs to detect the indexlevel. Moving the sub to InboxWritable (from Admin) probably makes sense since it's a per-inbox attribute and allows -watch to reuse it.
2023-03-29cindex: simplify some internal data structures
We'll rely more on local-ized `our' globals rather than hashref fields. The former is more resistant to typos and can be checked at compile-time earlier via `perl -c'. The {-internal} field is also renamed to {-cidx_internal} in case to reduce confusion within a large code base.
2023-03-26watch: do not recreate signalfd on SIGHUP
The normal method by which PublicInbox::DS::event_loop sets up signals once needs some coercing to work with -watch. Otherwise, we'll end up wasting FDs every time somebody reloads -watch via SIGHUP.
2023-03-25cindex: ignore SIGPIPE
We check for all socket write errors anyways, and I don't expect stderr output to be significant enough to matter.
2023-03-25cindex: squelch incompatible options
Some options don't make sense when used together.
2023-03-25cindex: implement --max-size=SIZE
This matches existing behavior of -index and -extindex, and will hopefully allow me to avoid OOM problems by skipping problematic commits.
2023-03-25cindex: implement --exclude= like -clone
This is to ensure we can exclude certain repos which are expensive-to-index (e.g. `**/deps.git', `**/transparency-logs/**').
2023-03-25codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to deduplicate and handle repositories with common history. I initially wanted to shoehorn this into extindex, but decided a separate Xapian index layout capable of being EITHER external to handle many forks or internal (in $GIT_DIR/public-inbox-cindex) for small projects is the right way to go. Unlike most existing parts of public-inbox, this relies on absolute paths of $GIT_DIR stored in the Xapian DB and does not rely on the config file. We'll be relying on the config file to map absolute paths to public URL paths for WWW.
2023-03-25admin: ensure resolved GIT_DIR is absolute
We'll also support the $base arg of File::Spec->rel2abs since it should make codesearch indexing easier.
2023-03-18clone: support --purge to delete remotely-deleted repos
This lets us clean up disk space when repos are removed on the remote side.
2023-03-07doc: update public-inbox-clone examples and help
Basically, public-inbox-clone has become grok-pull without config files nor absolute paths.
2023-02-21lei_mirror: support --remote-manifest=URL
Since PublicInbox::WWW already generates manifest.js.gz, I'm using an alternate path with PublicInbox::WwwStatic to host the manifest.js.gz for coderepos at an alternate location. The following snippet lets me host https://yhbt.net/lore/pub/manifest.js.gz for mirrored git repositories, while https://yhbt.net/lore/manifest.js.gz (no `pub') remains for inbox mirroring. ==> sample.psgi <== use PublicInbox::WWW; use PublicInbox::WwwStatic; my $www = PublicInbox::WWW->new; # use default PI_CONFIG my $st = PublicInbox::WwwStatic->new(docroot => '/path/to/code'); my $www_cb = sub { my ($env) = @_; if ($env->{PATH_INFO} eq '/pub/manifest.js.gz') { local $env->{PATH_INFO} = '/manifest.js.gz'; my $res = $st->call($env); return $res if $res->[0] != 404; } $www->call($env); }; builder { enable 'ReverseProxy'; enable 'Head'; mount '/lore' => $www_cb; }
2023-01-18ipc+lei: switch to awaitpid
This avoids awkwardly stuffing an arrayref into callbacks which expect multiple arguments. IPC->awaitpid_init now allows pre-registering callbacks before spawning workers.
2023-01-06clone: implement --exit-code
Since public-inbox-clone is now useful for incremental updates with manifest, --exit-code belongs here, too.