Date | Commit message (Collapse) |
|
For archivists with only newer mail archives, this option allows
reserving reserve NNTP article numbers for yet-to-be-archived
old messages. Indexers will need to be updated to support this
feature in future commits.
-V1 inboxes will now be initialized with SQLite and Xapian
support if this option is used, or if --indexlevel= is
specified.
|
|
Since V2 uses multiple git repositories, stop using
the word "repo" when referring to inboxes.
|
|
On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.
Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel. Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.
With a single shard, it still took around 3.5 days to index on
the HDD. That's not good, but it's far better than not
finishing after 7 days. So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.
For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size. In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
|
|
This will be used to prevent reloading a giant config with
tens/hundreds of thousands of inboxes from blocking the event
loop.
|
|
It shares a bit of code with NNTP. It's copy+pasted for now
since this provides new ground to experiment with APIs for
dealing with slow storage and many inboxes.
|
|
InboxWritable should only set $v2w->{parallel} if the $parallel
flag is defined to 0 or 1. We want indexing a new inbox to
utilize SMP, just like --reindex.
-index once again allows -j0/--jobs=0 to force single-process
use, and we'll be ensuring that works in tests to maintain
performance on small systems.
Fixes: 61a2fff5b34a3e32 ("admin: move index_inbox over")
|
|
I found myself wanting to remove a message from all inboxes
while working on a test case in another branch. I figure this
could also be useful for globally removing messages which are in
the grey area or too big for spamc.
|
|
There is obviously a typo here, so fix it and add a test
case to guard against future regressions.
Fixes: 74a3206babe0572a ("mda: support multiple List-ID matches")
|
|
Upon rereading the code, it wasn't immediately obvious to
me why we didn't check for errors with `close($w)' instead
of relying on `undef'. So add a comment for the benefit of
future readers.
|
|
In our inbox-writing code paths, ->getline as an OO method may
be confused with the various definitions of `getline' used by
the PSGI interface. It's also easier to do: "perldoc -f readline"
than to figure out which class "->getline" belongs to (IO::Handle)
and lookup documentation for that.
->print is less confusing than the "readline" vs "getline"
mismatch, but we can still make it clear we're using a real
file handle and not a mock interface.
Finally, functions are a bit faster than their OO counterparts.
|
|
On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).
Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
This allows us to simplify some of our existing code and make
future changes easier.
I doubt anybody goes through the trouble to have a Perl
installation without zlib support. The zlib source code is even
bundled with Perl since 5.9.3 for systems without existing zlib
development headers and libraries.
Of course, zlib is also a requirement of git, too; and we're not
going to stop using git :)
[squashed: "wwwaltid: use gzipfilter up front"]
|
|
In normal mail paths, we can rely on MTAs being configured with
reasonable limits in the -watch and -mda mail injection paths.
However, the MTA is bypassed in a git-only delivery path, a BOFH
could inject a large message and DoS users attempting to mirror
a public-inbox.
This doesn't protect unindexed WWW interfaces from Email::MIME
memory explosions on v1 inboxes. Probably nobody cares about
unindexed WWW interfaces anymore, especially now that Xapian is
optional for indexing.
|
|
It hasn't been needed since commit 089cca37fa036411
("config: ignore missing config files"). And we
actually want to propagate errors when we can't
start new processes or if git(1) is missing.
|
|
I did not know to use the return value of `do' back in the day.
There's probably no practical difference in these cases, but
`eval' is overkill for these uses and may hide actual errors.
We can get rid of a few redundant `scalar' ops and pass scalar
refs to Email::MIME->new to avoid copies in a few more places,
too.
|
|
There's nothing Maildir-specific about the function, so
`maildir_path_load' was a bad name. So give it a more
appropriate name and use it in our tests.
This save ourselves some code and inconsistency by reusing an
existing internal library routine in more places. We can drop
the "From_" line in some of our (formerly) mbox sample files.
|
|
It's more convenient to specify `-c' / `--compact' on the
command-line when reindexing than it is to invoke
public-inbox-compact(1) separately.
This is especially convenient in low-space situations when
public-inbox-index is operating on multiple inboxes
sequentially, as compaction can happen immediately after
indexing each inbox, instead of waiting until all inboxes are
indexed.
|
|
Since v2 inboxes contain multiple git repositories, avoid the
use of the word "repository" when referring to inboxes as a
whole in most places.
|
|
We don't want to blow up users storage too badly when converting
v1 to v2 or break because they don't have Xapian bindings installed.
|
|
I didn't wait until September to do it, this year!
|
|
The (currently undocumented) "--no-index" flag did not trigger
the V2Writable->done call necessary to make the import
successful.
Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
|
|
Relying on implicit "@_" for shift fails with
TestCommon::_run_sub iff GetOptions modifies @ARGV.
|
|
Looking at git history, they were never used.
|
|
If we're reusing the msgmap from a v1 inbox, we also need to
ensure the highwater mark doesn't get doubled in the v1->v2
conversion by internally triggering the equivalent of
"--reindex" on a fresh v2 inbox.
This was needed to convert an indexed v1 inbox which featured
messages with multiple Message-IDs in it. Fresh, unindexed
clones of v1 inboxes would not have been affected by this.
|
|
We already load PublicInbox::Import via
PublicInbox::InboxWritable, so it's not an extra module
to load. This can give us a slight speedup in tests.
|
|
This allows us to simplify version checking by avoiding
"//" or "||" operators sprinkled around.
|
|
Some users just want to run -mda, -watch, and/or -nntpd.
Let them run just those without forcing them to pull in a
bunch of dependencies.
|
|
There's a bunch of leftover "require" and "use" statements we no
longer need and can get rid of, along with some excessive
imports via "use".
IO::Handle usage isn't always obvious, so add comments
describing why a package loads it. Along the same lines,
document the tmpdir support as the reason we depend on
File::Temp 0.19, even though every Perl 5.10.1+ user has it.
While we're at it, favor "use" over "require", since it it gives
us extra compile-time checking.
|
|
And update callers to use it, as it makes the code a bit cleaner.
Probably irrelvant, but it should be faster, too, as
"perl -I lib -w -MO=Deparse $FILE" shows REJECT() calls are
constant-folded.
|
|
We can use "use" to get the namespace into the "BEGIN" phase of
the interpreter. While we're at it, use \&coderef syntax
explicitly instead of globbing everything.
|
|
This is distributed with Perl 5.10.1 and onwards, so it should
not be an installation burden for any users. I'm planning to
move away from tempdir() entirely and use File::Temp->newdir to
remove dependencies on END{} blocks.
|
|
We've been using this in -edit, and will be using it in some
more scripts and tests to optimize for run_mode=2 with
run_script.
Keeping this in the *Writable modules since I don't see it being
useful for the WWW and NNTP read-only interfaces which use
PublicInbox::Inbox.
|
|
Avoid 'Variable "%s" will not stay shared' warnings
when the contents of this script eval'ed into a sub.
|
|
Avoid 'Variable "%s" will not stay shared' warnings
when the contents of this script eval'ed into a sub.
|
|
Avoid 'Variable "%s" will not stay shared' warnings
when the contents of this script eval'ed into a sub.
We also need to rely on ->DESTROY instead of END{}
to unlink the lock file on sub exit.
|
|
Avoid 'Variable "%s" will not stay shared' warnings
when the contents of this script eval'ed into a sub.
|
|
PublicInbox::Admin::config() just adds an extra layer of
indirection which we barely rely on. So get rid of this
global variable and make it easier to run tests in the
future without relying on global state.
|
|
Instead of relying on END{} blocks, rely on ->DESTROY
so the temporary files go out-of-scope and system
resources get released, sooner.
|
|
Avoid 'Variable "%s" will not stay shared' warnings
when the contents of this script eval'ed into a sub.
|
|
We only need to parse the command-line once.
|
|
InboxWritable caching the result of ->importer leads to a
circular references with returned (V2Writable|Import) object
holds onto the calling InboxWritable object.
With public-inbox-watch, this leads to a memory leak if a user
is reloading via SIGHUP after a message is imported (it would
only become noticeable with SIGHUPs after every message imported).
I would not expect anybody to to notice this in real-world
usage. I only noticed this since I was making -xcpdb suitable
for long-lived process use (e.g. "mod_perl style") and a flock
remained unreleased on v1 inboxes after resharding.
WatchMaildir (used by -watch) already handles caching of the
importer object itself, and all of our other real-world uses of
->importer are short-lived or designed for batch scripts, so
there's no need to cache the importer result internally.
|
|
We need to check every print to a regular file for errors,
because storage devices inevitably fail.
|
|
exit($?) is never correct, since ($? >> 8) is needed to extract
the correct exit code, as other information (e.g. such as signal)
is encoded in $? in addition to the exit code.
|
|
While it's not RFC2919-conformant, mail software can
theoretically set multiple List-ID headers. Deliver to all
inboxes which match a given List-ID since that's likely the
intended.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Link: https://public-inbox.org/meta/87pniltscf.fsf@x220.int.ebiederm.org/
|
|
Multiple List-ID headers will be supported in the next commit
|
|
And use it for mda, since "0" could be a usable directory
if somebody insists on using relative paths...
|
|
We don't want to waste cycles parsing the message for MIME bits
if it's spam.
|
|
It makes it easier to document the default -mda behavior is
stricter than normal, including "public-inbox-learn ham"
|