Date | Commit message (Collapse) |
|
A few more things happened, here.
|
|
I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.
The Xapian MultiDatabase problem will need to be addressed
for 1.7...
|
|
This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
It may be too easily confused for --newsgroup or --ng. This is
too rarely used and never made it into a release, so it should
be fine.
|
|
We can reduce the need to edit the config file for NNTP group names
this way.
|
|
Slowly improving the learning curve...
|
|
Determining storage device speed and latencies doesn't
seem portable or even possible with the wide variety
of storage layers in use.
This means we need to write a tuning document and hope
users read and improve on it :P
|
|
For -index, this is a convenient way to quickly index all
inboxes after a grok-pull. Might as well support it for
rarely used commands like -compact and -xcpdb, too.
|
|
--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.
There was a missing "use strict", too, which is fixed.
|
|
Converting v1 inboxes from v2 can be a painful experience
on HDD. Some of the new options in the CLI or config
file make it less painful.
|
|
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
|
|
We parse other options, too, not just --max-size
|
|
These rarely-used commands have some caveats that needed
expanding on.
|
|
With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful. I was able to index LKML in ~16 hours on a
system that had other activity on it. The big downside was it
was eating up over 5g of RAM :x.
We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
|
|
Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness. The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|
|
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.
Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately. The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.
Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard. Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.
Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
|
|
This allows us to speed up indexing operations to SQLite
and Xapian.
Unfortunately, it doesn't affect operations using
`xapian-compact' and the compactor API, since that doesn't seem
to support Xapian::DB_NO_SYNC, yet.
|
|
Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases. This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
|
|
grok-pull is still painful with serialization on an old USB 2.0
HDD, but at least it can finish with flock(1) and disabling
parallelization. While parallel "git fetch" doesn't seem so
bad, slow seeks are exacerbated by parallel reads in Xapian.
That means some updates can take days instead of hours. The
same updates take only seconds or minutes on an SSD.
|
|
Update release notes with some features in the 1.6 timeline.
We'll note the version availability of some command-line
options, it may help users who are reading the latest
documentation online but running older versions.
|
|
We'll be implementing some IMAP search/threading extensions in
IMAP and providing analogues over HTTP via JMAP.
|
|
Right now[1] the Perl upstream plan is to maintain 5 compatibility
in Perl 7 for at least 5 years[1], and perhaps drop it when Perl 8
comes along. That said, distros may pick it and maintain 5 on their
own given the vast amounts of perfectly good legacy code out there.
[1] http://nntp.perl.org/group/perl.perl5.porters/257817
[2] http://nntp.perl.org/group/perl.perl5.porters/257565
|
|
I originally proposed this rewording to address Leah's comment
but forgot to squash it in :x
Link: https://public-inbox.org/meta/20200408221741.GA10142@dcvr/
Cc: Leah Neukirchen <leah@vuxu.org>
|
|
`~/.cache/public-inbox/inline-c' is supported, nowadays
for convenience, but Inline::C usage will remain opt-in.
|
|
This simplifies the primary callers of eml_entry while only making
mknews.perl worse.
|
|
We no longer favor getline+close for streaming PSGI responses
when using public-inbox-httpd. We still support it for other
PSGI servers, though.
|
|
We can save stack space and simplify subroutine calls, here.
|
|
This will make it easier to support asynchronous blob
retrievals. The `$ctx->{nr}' counter is no longer implicitly
supplied since many users didn't care for it, so stack overhead
is slightly reduced.
|
|
Like with WwwAtomStream and MboxGz, we can bless the existing
$ctx object directly to avoid allocating a new hashref. We'll
also switch from "->" to "::" to reduce stack utilization.
|
|
This allows -httpd to handle other requests while waiting
for git to retrieve and decode blobs. We'll also break
apart t/psgi_v2.t further to ensure tests run against
-httpd in addition to generic PSGI testing.
Using xt/httpd-async-stream.t to test against clones of meta@public-inbox.org
shows a 10-12% performance improvement with the following env:
TEST_JOBS=1000 TEST_CURL_OPT=--compressed TEST_ENDPOINT=new.atom
|
|
We always return Z (UTC) times, anyways, so we'll always
use gmtime() on the seconds-after-the-epoch.
|
|
Instead of gzipping some (mbox.gz, manifest.js.gz) responses and
leaving P::M::D to do the rest, we gzip everything ourselves,
now, so P::M::D is redundant.
|
|
Since we already use inotify and EVFILT_VNODE (kqueue)
in -imapd, we might as well use them directly in -watch,
too.
This will allow public-inbox-watch to use PublicInbox::DS
for timers to watch newsgroups/mailboxes and have saner
signal handling in future commits.
|
|
For archivists with only newer mail archives, this option allows
reserving reserve NNTP article numbers for yet-to-be-archived
old messages. Indexers will need to be updated to support this
feature in future commits.
-V1 inboxes will now be initialized with SQLite and Xapian
support if this option is used, or if --indexlevel= is
specified.
|
|
On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.
Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel. Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.
With a single shard, it still took around 3.5 days to index on
the HDD. That's not good, but it's far better than not
finishing after 7 days. So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.
For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size. In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
|
|
Lots of big changes coming Thanks to The Linux Foundation for
sponsoring me to hack on this in 2020 :)
|
|
RFC 2683 section 3.2.1.5 recommends it:
> For its part, a server should allow for a command line of at least
> 8000 octets. This provides plenty of leeway for accepting reasonable
> length commands from clients. The server should send a BAD response
> to a command that does not end within the server's maximum accepted
> command length.
To conserve memory, we won't bother reading the entire line
before sending the BAD response and disconnecting them.
|
|
It seems to be working as far as Mail::IMAPClient is concerned.
|
|
It shares a bit of code with NNTP. It's copy+pasted for now
since this provides new ground to experiment with APIs for
dealing with slow storage and many inboxes.
|
|
There's more, but IMAP is big and complex already.
|
|
To further simplify callers and avoid embarrasing memory
explosions[1], we can finally eliminate this method in
favor of smsg_eml.
[1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5
("view: stop storing all MIME objects on large threads")
fixed a huge memory blowup.
|
|
I found myself wanting to remove a message from all inboxes
while working on a test case in another branch. I figure this
could also be useful for globally removing messages which are in
the grey area or too big for spamc.
|
|
On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).
Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|
|
|
|
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
Since some client tools exist for dealing with public-inbox
specifically, it seems like a good idea to list some of them.
Cc: Danh Doan <congdanhqx@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: Leah Neukirchen <leah@vuxu.org>
|