Date | Commit message (Collapse) |
|
This to avoid user error of a currently undocumented switch;
since --xapian-only always goes through the full history at
the moment.
|
|
Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness. The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.
|
|
If XAPIAN_FLUSH_THRESHOLD is unset, Xapian will default to
10000. That limits the effectiveness of users specifying
extremely large values of --batch-size.
While we're at it, localize the changes to globals since -index
may be eval-ed in tests (and perhaps production code in the
future).
|
|
Since the --compact switch works on Xapian shards,
it makes sense that --sequential-shard affects our
usage of xapian-compact(1).
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|
|
This is useful for speeding up indexing runs when only Xapian
rules change but SQLite indexing doesn't change. This mostly
implies `--reindex', but does NOT pick up new messages (because
SQLite indexing needs to occur for that).
I'm leaving this undocumented in the manpage for now since it's
mainly to speed up development and testing. Users upgrading to
1.6.0 will be advised to `--reindex --rethread', anyways, due to
the threading improvements since 1.1.0-pre1.
It may make sense to document for 1.7+ when there's Xapian-only
indexing changes, though.
|
|
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.
Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately. The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.
Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard. Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.
Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
|
|
Thanks to the GCC compile farm project, we can wire up syscalls
for sparc64 and set system-specific SFD_* constants properly.
I've FINALLY figured out how to use POSIX::SigSet to generate
a usable buffer for the syscall perlfunc. This is required
for endian-neutral behavior and relevant to sparc64, at least.
There's no need for signalfd-related stuff to be constants,
either. signalfd initialization is never a hot path and a stub
subroutine for constants uses several KB of memory in the
interpreter.
We'll drop the needless SEEK_CUR import while we're importing
O_NONBLOCK, too.
|
|
We used ->header_obj in the past as an optimization with
Email::MIME. That optimization is no longer necessary
with PublicInbox::Eml.
This doesn't make any functional difference even if we were to
go back to Email::MIME. However, it reduces the amount of code
we have and slightly reduces allocations with PublicInbox::Eml.
|
|
This is more accurate given we use PublicInbox::Eml instead
of Email::MIME/PublicInbox::MIME, nowadays.
|
|
Tests for failures should not leave junk temporary files lying
around in a users' ~/.public-inbox/.
On a side note, I'm not sure if PI_DIR is or was ever
necessary. It's never been documented, so perhaps
using $HOME for this is better...
|
|
And -compact supports --jobs=0 like -index to disable parallel
execution. Running three xapian-compact processes in parallel
on a USB 2.0 HDD is pretty painful.
|
|
This allows us to speed up indexing operations to SQLite
and Xapian.
Unfortunately, it doesn't affect operations using
`xapian-compact' and the compactor API, since that doesn't seem
to support Xapian::DB_NO_SYNC, yet.
|
|
Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases. This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
|
|
"\n" and other characters requiring quoting and/or escaping in
in $GIT_DIR/objects/info/alternates was not supported in git 2.11
and earlier; nor does it seem supported at all in libgit2.
This will allow us to support sharing git-cat-file or similar
endpoints across multiple inboxes via alternates.
This breaks an existing use case for anybody wacky
enough to put `\n' in the `inboxdir' pathname; but I doubt
this affects anybody.
|
|
Instead of gzipping some (mbox.gz, manifest.js.gz) responses and
leaving P::M::D to do the rest, we gzip everything ourselves,
now, so P::M::D is redundant.
|
|
In case output is redirected to a pipe, ensure stdout and stderr
are always unbuffered, as -watch may go long periods without
any output to fill up buffers.
|
|
We can avoid synchronous `waitpid(-1, 0)' and save a process
when simultaneously watching Maildirs.
One DS bug is fixed: ->Reset needs to clear the DS $in_loop flag
in forked children so dwaitpid() fails and allows git processes
to be reaped synchronously. TestCommon also calls DS->Reset
when spawning new processes, since t/imapd.t uses DS->EventLoop
while waiting on -watch to write.
|
|
We can get rid of the janky wannabe
self-using-a-directory-instead-of-pipe thing we needed to
workaround Filesys::Notify::Simple being blocking.
For existing Maildir users, this should be more robust and
immune to missed wakeups for signalfd and kqueue-enabled
systems; as well as being immune to BOFHs clearing $TMPDIR
and preventing notifications from firing.
The IMAP IDLE code still uses normal Perl signals, so it's still
vulnerable to missed wakeups. That will be addressed in future
commits.
|
|
Since we already use inotify and EVFILT_VNODE (kqueue)
in -imapd, we might as well use them directly in -watch,
too.
This will allow public-inbox-watch to use PublicInbox::DS
for timers to watch newsgroups/mailboxes and have saner
signal handling in future commits.
|
|
For archivists with only newer mail archives, this option allows
reserving reserve NNTP article numbers for yet-to-be-archived
old messages. Indexers will need to be updated to support this
feature in future commits.
-V1 inboxes will now be initialized with SQLite and Xapian
support if this option is used, or if --indexlevel= is
specified.
|
|
Since V2 uses multiple git repositories, stop using
the word "repo" when referring to inboxes.
|
|
On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.
Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel. Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.
With a single shard, it still took around 3.5 days to index on
the HDD. That's not good, but it's far better than not
finishing after 7 days. So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.
For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size. In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
|
|
This will be used to prevent reloading a giant config with
tens/hundreds of thousands of inboxes from blocking the event
loop.
|
|
It shares a bit of code with NNTP. It's copy+pasted for now
since this provides new ground to experiment with APIs for
dealing with slow storage and many inboxes.
|
|
InboxWritable should only set $v2w->{parallel} if the $parallel
flag is defined to 0 or 1. We want indexing a new inbox to
utilize SMP, just like --reindex.
-index once again allows -j0/--jobs=0 to force single-process
use, and we'll be ensuring that works in tests to maintain
performance on small systems.
Fixes: 61a2fff5b34a3e32 ("admin: move index_inbox over")
|
|
I found myself wanting to remove a message from all inboxes
while working on a test case in another branch. I figure this
could also be useful for globally removing messages which are in
the grey area or too big for spamc.
|
|
There is obviously a typo here, so fix it and add a test
case to guard against future regressions.
Fixes: 74a3206babe0572a ("mda: support multiple List-ID matches")
|
|
Upon rereading the code, it wasn't immediately obvious to
me why we didn't check for errors with `close($w)' instead
of relying on `undef'. So add a comment for the benefit of
future readers.
|
|
In our inbox-writing code paths, ->getline as an OO method may
be confused with the various definitions of `getline' used by
the PSGI interface. It's also easier to do: "perldoc -f readline"
than to figure out which class "->getline" belongs to (IO::Handle)
and lookup documentation for that.
->print is less confusing than the "readline" vs "getline"
mismatch, but we can still make it clear we're using a real
file handle and not a mock interface.
Finally, functions are a bit faster than their OO counterparts.
|
|
On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).
Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
This allows us to simplify some of our existing code and make
future changes easier.
I doubt anybody goes through the trouble to have a Perl
installation without zlib support. The zlib source code is even
bundled with Perl since 5.9.3 for systems without existing zlib
development headers and libraries.
Of course, zlib is also a requirement of git, too; and we're not
going to stop using git :)
[squashed: "wwwaltid: use gzipfilter up front"]
|
|
In normal mail paths, we can rely on MTAs being configured with
reasonable limits in the -watch and -mda mail injection paths.
However, the MTA is bypassed in a git-only delivery path, a BOFH
could inject a large message and DoS users attempting to mirror
a public-inbox.
This doesn't protect unindexed WWW interfaces from Email::MIME
memory explosions on v1 inboxes. Probably nobody cares about
unindexed WWW interfaces anymore, especially now that Xapian is
optional for indexing.
|
|
It hasn't been needed since commit 089cca37fa036411
("config: ignore missing config files"). And we
actually want to propagate errors when we can't
start new processes or if git(1) is missing.
|
|
I did not know to use the return value of `do' back in the day.
There's probably no practical difference in these cases, but
`eval' is overkill for these uses and may hide actual errors.
We can get rid of a few redundant `scalar' ops and pass scalar
refs to Email::MIME->new to avoid copies in a few more places,
too.
|
|
There's nothing Maildir-specific about the function, so
`maildir_path_load' was a bad name. So give it a more
appropriate name and use it in our tests.
This save ourselves some code and inconsistency by reusing an
existing internal library routine in more places. We can drop
the "From_" line in some of our (formerly) mbox sample files.
|
|
It's more convenient to specify `-c' / `--compact' on the
command-line when reindexing than it is to invoke
public-inbox-compact(1) separately.
This is especially convenient in low-space situations when
public-inbox-index is operating on multiple inboxes
sequentially, as compaction can happen immediately after
indexing each inbox, instead of waiting until all inboxes are
indexed.
|
|
Since v2 inboxes contain multiple git repositories, avoid the
use of the word "repository" when referring to inboxes as a
whole in most places.
|
|
We don't want to blow up users storage too badly when converting
v1 to v2 or break because they don't have Xapian bindings installed.
|
|
I didn't wait until September to do it, this year!
|
|
The (currently undocumented) "--no-index" flag did not trigger
the V2Writable->done call necessary to make the import
successful.
Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
|
|
Relying on implicit "@_" for shift fails with
TestCommon::_run_sub iff GetOptions modifies @ARGV.
|
|
Looking at git history, they were never used.
|
|
If we're reusing the msgmap from a v1 inbox, we also need to
ensure the highwater mark doesn't get doubled in the v1->v2
conversion by internally triggering the equivalent of
"--reindex" on a fresh v2 inbox.
This was needed to convert an indexed v1 inbox which featured
messages with multiple Message-IDs in it. Fresh, unindexed
clones of v1 inboxes would not have been affected by this.
|
|
We already load PublicInbox::Import via
PublicInbox::InboxWritable, so it's not an extra module
to load. This can give us a slight speedup in tests.
|
|
This allows us to simplify version checking by avoiding
"//" or "||" operators sprinkled around.
|
|
Some users just want to run -mda, -watch, and/or -nntpd.
Let them run just those without forcing them to pull in a
bunch of dependencies.
|
|
There's a bunch of leftover "require" and "use" statements we no
longer need and can get rid of, along with some excessive
imports via "use".
IO::Handle usage isn't always obvious, so add comments
describing why a package loads it. Along the same lines,
document the tmpdir support as the reason we depend on
File::Temp 0.19, even though every Perl 5.10.1+ user has it.
While we're at it, favor "use" over "require", since it it gives
us extra compile-time checking.
|