Date | Commit message (Collapse) |
|
getpid() isn't cached by glibc nowadays and system calls are
more expensive due to CPU vulnerability mitigations. To
ensure we switch to the new semantics properly, introduce
a new `on_destroy' function to simplify callers.
Furthermore, most OnDestroy correctness is often tied to the
process which creates it, so make the new API default to
guarded against running in subprocesses.
For cases which require running in all children, a new
PublicInbox::OnDestroy::all call is provided.
|
|
|
|
Fixes: c76a20d75200 ("cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'")
|
|
Accepting @ARGV without switches ends up being ambiguous with
optional parameters for --join and --show. Requiring users to
specify `--join=' or `--show=' is a bit awkward (as it with
-clone --objstore= and the like, but that is historical baggage
we need to carry at this point...)
|
|
The association data is just stored as deflated JSON in Xapian
metadata keys of shard[0] for now. It should be reasonably
compact and fit in memory for now since we'll assume sane,
non-malicious git coderepo history, for now.
The new cindex-join.t test requires TEST_REMOTE_JOIN=1 to be
set in the environment and tests the joins against the inboxes
and coderepos of two small projects with a common history.
Internally, we'll use `ibx_off', `root_off' instead of `ibx_id'
and `root_id' since `_id' may be mistaken for columns in an SQL
database which they are not.
|
|
This is shorthand to enabling --associate with the most
aggressive (and time-consuming) options available, starting from
the Unix epoch and having an unlimited window to join on.
|
|
"window" is probably a better term since it's an inexact thing
to match on.
|
|
read_all can be expanded to support FIFOs/pipes/sockets where
read-until-EOF behavior is desired. We can also rely on
wantarray to support splitting on EOL markers, but it's
hard-coded to support only `$/ eq "\n"' since (AFAIK)
it's the only way we use the wantarray form `readline'.
|
|
-mda now honors `--help' properly and invocations missing
ORIGINAL_RECIPIENT now fail with EX_NOUSER.
Helped-by: Leah Neukirchen <leah@vuxu.org>
Link: https://public-inbox.org/meta/87msvlguqu.fsf@vuxu.org/
|
|
List-Unsubscribe headers with unique identifiers (such as those
generated by our examples/unsubscribe.milter) should not
end up in public archives. Add a new config knob to strip
List-Unsubscribe headers if they have the
`List-Unsubscribe-Post: List-Unsubscribe=One-Click'
header.
Unfortunately, this breaks DKIM signatures if the signature
covers either of these List-Unsubscribe* headers. However,
breaking DKIM is the lesser evil compared to any archive reader
being able to stop archival by an independent archivist.
As much as I would like this to be the default, it probably
affects few users at the moment since very few mailing lists
use unique identifiers in List-Unsubscribe (but that number
has grown, recently).
|
|
When learning and injecting new messages ham, we want to avoid
wasting cycles importing the same message into an inbox twice
(once for the To/Cc match and once for the List-Id match). Our
existing %seen hash turned out to be ineffective since
PublicInbox::Inbox refs get re-blessed to PublicInbox::InboxWritable.
So we stop letting class name influence the hash key for tracking by
using the reference address instead. We can get the reference address
by performing an arithmetic operation (+ 0) instead of having to
pay the cost of importing Scalar::Util::refaddr.
|
|
The IO package seems like a better home for I/O subs than the
Git package. We lose the 60 second read timeout for `git
cat-file --batch-*' processes since it's probably not necessary
given how reliable the code has proven and things would fall
over hard in other ways if the storage device were completely
hosed.
|
|
This will open the door for us to drop `tie' usage from
ProcessIO completely in favor of OO method dispatch. While
OO method dispatches (e.g. `$fh->close') are slower than normal
subroutine calls, it hardly matters in this case since process
teardown is a fairly rare operation and we continue to use
`close($fh)' for Maildir writes.
|
|
There wasn't a need to loop anyways with Perl `read' since
the default PerlIO layer will retry.
|
|
This hurts startup time a bit, but our tests use run_script
by default and I don't think normal users call -init enough
to care.
|
|
It's actually valid Perl syntax, but still confusing to look at.
Fixes: add90b9504f4 ("support -C (chdir) for most non-daemon commands")
|
|
`readline' ops may not detect errors on partial reads.
This saves us some code to reduce cognitive overhead for
readers. We'll also support reusing a destination buffers so it
can work more nicely with existing code.
|
|
v2 never suffered from this bug, apparently, but -learn didn't
seem able to handle indexlevel=basic (nor respect `medium')
for v1 inboxes. I only noticed this bug because I converted
some ancient v1 inboxes to `basic' to save space.
|
|
We use fewer file descriptors and fewer lines of code this way.
I'm not aware of any place we rely on POSIX pipe semantics with
`git fast-import', and sockets have bigger buffers by default
in most cases (even if Linux allows larger pipe buffers).
|
|
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a
(mda: drop leading "From " lines again, 2016-06-26)), we'll
always have to be dealing with mutt piping messages to us and
`git format-patch' output. So just share the regexp so we
can use it everywhere.
In may be desirable to allow importing messages with a leading
"From " line for FUSE, even.
Additionally, some instances of this regexp needlessly added
optional `\r?' (CR) checks ahead of the `\n' (LF) element; but
they're pointless anyways since [^\n]* is enough to exclude all
non-LF bytes.
|
|
This ensures script/lei $send_cmd usage is EINTR-safe (since
I prefer to avoid loading PublicInbox::IPC for startup time).
Overall, it saves us some code, too.
|
|
While forked processes inherit from the parent, exec-ed
processes need the `-w' flag passed to them. To determine
whether or not we should pass them, we must check the `$^W'
global perlvar, first.
We'll also favor `perl -e' over `perl -E' in places where
we don't rely on the latest features, since `-E' incurs
slightly more startup time overhead from loading feature.pm
(while `perl -Mv5.12' does not).
|
|
ProcessPipe->CLOSE will already run waitpid for us and
exit on errors, so we can do less, here.
|
|
The Xapian SWIG bindings are favored by Xapian upstream for
ease-of-maintenance compared to the XS version. While Debian
lags on this front, the SWIG bindings are widely available
on all *BSDs.
|
|
Child processes handling IMAP/NNTP aren't going to want
to handle config reloads nor forced rescans, those are
exclusively for the parent. We'll leave a note that
QUIT/TERM/INT can safely use the same callback for both
parent and children, as I nearly made the mistake of
resetting those to their default values in the child.
|
|
We need to ensure there isn't a window where we lose $SIG{CHLD}
handling. This is the second part in getting t/imapd.t to pass
the reload-after-setting-imap.pollInterval test
That said, I'm not entirely happy with the way -watch jumps
in and out of the event loop. It's historical baggage from
the pre-event_loop days.
|
|
Blindly using the signal set inherited from the parent process
is wrong, since the parent (or grandparent) could've blocked all
signals. Ensure children can process signals in the event loop
when sig handlers have to use standard Perl facilities.
|
|
It's apparently not needed for AF_UNIX + SOCK_SEQPACKET as our
receivers never check for MSG_EOR in "struct msghdr".msg_flags
anyways. I don't believe POSIX is clear on the exact semantics
of MSG_EOR on this socket type. This works around truncation
problems on OpenBSD recvmsg when MSG_EOR is used by the sender.
Link: https://marc.info/?i=20230826020759.M335788@dcvr
|
|
Creating config 0600 disregarding umask breaks scenarios where daemons
run with credentials different from config owner (but need to read the
config).
File::Temp defaults to 0600, which is unsuitable for the
recommended/typical scenario of daemons running unprivileged and with
UID different from $PI_CONFIG owner, as the deamons need to read
$PI_CONFIG.
Respecting umask might end up creating world-unreadable config, too,
but for people who use such umask that's expected behavior.
|
|
|
|
This aids in development, but I'm not sure it's going to stay
or be moved into another interface.
|
|
This will eventually allow associating coderepos with inboxes
and vice-versa; avoiding the need for manual configuration via
tedious publicinbox.*.coderepo directives.
I'm not sure how this should be stored for WWW, yet, but it's
required since it takes about 8 hours to do this fully across
lore and git.kernel.org.
|
|
xcpdb is necessary for upgrading Xapian backends (e.g. glass to
honey), thus codesearch indices (cindex) must be supported.
Resharding is also useful if CPU count is altered on system
upgrades or downgrades.
cindex Xapian sharding is completely different than anything
else we do, so the resharding operation must be a special case
based on existing cindex sharding rules.
|
|
This is much easier to support than xcpdb since it's 1:1 and
doesn't follow a different sharding scheme than the inboxes and
extindices.
|
|
With my partial git.kernel.org mirror, this brings a full prune
down from ~75 minutes to under 5 minutes using git 2.19+. This
speedup even applies to users on slow storage (rotational HDD).
First off, xapian-delve(1) is nearly 10x faster for dumping
boolean terms by prefix than the equivalent Perl code with
Xapian bindings. This performance difference is critical since
we need to check over 5 million commits for pruning a partial
git.kernel.org mirror.
We can use sed(1) and sort(1) to massage delve output into
something suitable for the first comm(1) input.
For the second comm(1) input, the output of `git cat-file
--batch-check --batch-all-objects' against all indexed git repos
with awk(1) filtering provides the necessary output for
generating a list of indexed-but-no-longer accessible commits.
sed(1) and awk(1) are POSIX standard tools which can be roughly
2x faster than equivalent Perl for simple filters, while
sort(1) is designed to handle larger-than-memory datasets
efficiently (unlike the `sort' perlop).
With slow storage and git <2.19, the switch to --batch-all-objects
actually results in a performance regression since having git
perform sorting results in worse disk locality than the previous
sequential iteration by Xapian docid. git 2.19+ users with
`--unordered' support benefits from improved storage locality;
and speedups from storage locality dwarfs the extra overhead of
an extra external sort(1) invocation.
Even with consumer-grade SATA-II SSDs, the combo of --unordered
and sort(1) provides a noticeable speedup since SSD latency
remains a factor for --batch-all-objects.
git <2.19 users must upgrade git to get acceptable performance
on slow storage and giant indexes, but git 2.19 was released
nearly 5 years ago so it's probably a reasonable requirement for
performance.
The only remaining downside of this change for all users
the extra temporary disk space for sort(1) and comm(1);
but the speedup provided with git 2.19+ is well worth it.
|
|
This lets us get rid of some awkwardness around the old API
and single-use subroutines while saving us some LoC.
|
|
I favor leaving the publicinbox.<name>.indexlevel parameter
out of config files to make it easier to alter and reduce
sources of truth. It worked well in most cases, but
public-inbox-watch also needs to detect the indexlevel.
Moving the sub to InboxWritable (from Admin) probably makes
sense since it's a per-inbox attribute and allows -watch
to reuse it.
|
|
We'll rely more on local-ized `our' globals rather than
hashref fields. The former is more resistant to typos
and can be checked at compile-time earlier via `perl -c'.
The {-internal} field is also renamed to {-cidx_internal}
in case to reduce confusion within a large code base.
|
|
The normal method by which PublicInbox::DS::event_loop sets up
signals once needs some coercing to work with -watch.
Otherwise, we'll end up wasting FDs every time somebody reloads
-watch via SIGHUP.
|
|
We check for all socket write errors anyways, and I don't expect
stderr output to be significant enough to matter.
|
|
Some options don't make sense when used together.
|
|
This matches existing behavior of -index and -extindex, and
will hopefully allow me to avoid OOM problems by skipping
problematic commits.
|
|
This is to ensure we can exclude certain repos which are
expensive-to-index (e.g. `**/deps.git', `**/transparency-logs/**').
|
|
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
|
|
We'll also support the $base arg of File::Spec->rel2abs
since it should make codesearch indexing easier.
|
|
This lets us clean up disk space when repos are removed
on the remote side.
|
|
Basically, public-inbox-clone has become grok-pull without
config files nor absolute paths.
|
|
Since PublicInbox::WWW already generates manifest.js.gz, I'm
using an alternate path with PublicInbox::WwwStatic to host the
manifest.js.gz for coderepos at an alternate location. The
following snippet lets me host
https://yhbt.net/lore/pub/manifest.js.gz for mirrored git
repositories, while https://yhbt.net/lore/manifest.js.gz
(no `pub') remains for inbox mirroring.
==> sample.psgi <==
use PublicInbox::WWW;
use PublicInbox::WwwStatic;
my $www = PublicInbox::WWW->new; # use default PI_CONFIG
my $st = PublicInbox::WwwStatic->new(docroot => '/path/to/code');
my $www_cb = sub {
my ($env) = @_;
if ($env->{PATH_INFO} eq '/pub/manifest.js.gz') {
local $env->{PATH_INFO} = '/manifest.js.gz';
my $res = $st->call($env);
return $res if $res->[0] != 404;
}
$www->call($env);
};
builder {
enable 'ReverseProxy';
enable 'Head';
mount '/lore' => $www_cb;
}
|
|
This avoids awkwardly stuffing an arrayref into callbacks
which expect multiple arguments. IPC->awaitpid_init now
allows pre-registering callbacks before spawning workers.
|
|
Since public-inbox-clone is now useful for incremental updates
with manifest, --exit-code belongs here, too.
|