Date | Commit message (Collapse) |
|
Inboxes may be removed or newsgroups renamed over time.
Introduce a switch to do garbage collection and eliminate stale
search and xref3 results based on inboxes which remain in the
config file.
This may also fixup stale results leftover from any bugs which
may leave stale data around.
This is also useful in case a clumsy BOFH (me :P) is swapping
between several PI_CONFIGs and accidentally indexed a bunch of
inboxes they didn't intend to.
|
|
v1 and v2 inbox indexing now supports graceful shutdown checks
just like ExtSearchIdx. Additionally, we'll consistently
perform quit checks at the top of loops for consistency.
Interaction with the --xapian-only and --sequential-shard
options are a bit lacking, and will warn the user to use
"--reindex --xapian-only" to fix.
|
|
This should make it possible for us quickly generate
manifest.js.gz files with less random I/O and process
spawning in the WWW code.
|
|
Since extindex is entirely new, it doesn't have backwards
compatibility concerns and never stored docdata, anyways.
|
|
Calling PublicInbox::Admin::index_prepare is required for
--batch-size (k|m|g) modifiiers and indexBatchSize in the config
file. Otherwise, the default 1m batch size stuck and led
to unexpectedly bad performance on a machine which could index
v2 inboxes faster with larger batch sizes.
|
|
Matching the behavior of git-fast-import(1), we'll allow a user
to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
|
|
With async git blob retrievals, the OID being enqueued and the
OID being processed can be totally unrelated and misleading.
We'll also prefix $INBOX_DIR for v2, and not just the epoch
since we could be indexing multiple inboxes via both -index
and -extindex.
|
|
Upon "eindex" rhymes with "reindex", which could be confusing;
so name the command and config prefix to use "extindex" which
is hopefully less confusing.
|
|
This doesn't do anything, yet, but it will once the rest
of the eindex stuff works.
|
|
Not documented, yet, but it runs...
|
|
It seems easiest to have a singleton Gcf2Client client object
per daemon worker for all inboxes to use. This reduces overall
FD usage from pipes.
The `public-inbox-gcf2' command + manpage are gone and a `$^X'
one-liner is used, instead. This saves inodes for internal
commands and hopefully makes it easier to avoid mismatched
PERL5LIB include paths (as noticed during development :x).
We'll also make the existing cat-file process management
infrastructure more resilient to BOFHs on process killing
sprees (or in case our libgit2-based code fails on us).
(Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd
won't automatically benefit from this change, and extra
configuration will be required (to be documented later).
|
|
This amortizes the cost of recreating PublicInbox::Gcf2 objects
when alternates change in v2 all.git.
|
|
Since we only get OIDs from trusted local data sources
(over.sqlite3), we can safely retry within the -gcf2 process
without worry about clients spamming us with requests for
invalid OIDs and triggering reopens.
|
|
This should be able to replace multiple `git cat-file' for blob
retrieval, but adjustments may be needed.
|
|
Unfortunately, I'm not sure how easy catching these at
compile-time, is. Prototypes do not seem to check these
at compile time when crossing packages (not even with
exported subroutines).
|
|
Following "git init" as an example, we'll create every parent
path up to the one specified, instead of attempting to continue
on when Cwd::abs_path returns `undef'.
|
|
And avoid unnecessary POD markup in the man page.
|
|
"use Getopt::Long" doesn't seem too slow on a hot page cache,
and it's probably used frequently enough to be in cache.
We'll also start reducing the amount of markup in the .pod and
favoring verbatim text in documentation for readability in
source form, since the bold text seems excessive.
|
|
`-h' doesn't conflict with anything, and some users (including
git users) may be more accustomed to using it rather than the
rarely-seen-outside-of-Getopt::Long `-?' switch.
We can also rely on the GetOptions() function to emit a proper
error message instead of just "bad command-line args".
|
|
And while we're at it, note edit is *destructive* to encourage
reading the fine manual.
|
|
It's useful to mark they're meant to be executable, even
if the shebang is useless.
|
|
Otherwise, users may be frustrated to discover it missing
a long indexing run.
|
|
Sometimes it may not be apparent when/if a signal is
processed, this hopefully improves the situation.
We'll also change the process title when we're quitting
to better inform users.
|
|
This is no longer limited to Maildirs now that IMAP and NNTP
support exist; so give it a shorter name.
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
It may be too easily confused for --newsgroup or --ng. This is
too rarely used and never made it into a release, so it should
be fine.
|
|
We can reduce the need to edit the config file for NNTP group names
this way.
|
|
And speed those up with some lazy loading, too.
|
|
This probably won't be used much, but --help can still
make sense.
|
|
For -index, this is a convenient way to quickly index all
inboxes after a grok-pull. Might as well support it for
rarely used commands like -compact and -xcpdb, too.
|
|
--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.
There was a missing "use strict", too, which is fixed.
|
|
This was omitted in 8b1950055d51d436 :x
Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'")
|
|
We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).
|
|
Instead of silently ignoring excessive args, don't let a user
specify an extra directory. Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P
|
|
Lazy-loading dependencies speeds up --help by several hundred
milliseconds and is a huge step towards user-friendliness.
|
|
Converting v1 inboxes from v2 can be a painful experience
on HDD. Some of the new options in the CLI or config
file make it less painful.
|
|
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
|
|
We can use open(..., undef) natively in Perl in t/import.t
In places where we need a pathname, the File::Temp OO API
gives us auto-unlinking for free.
|
|
This to avoid user error of a currently undocumented switch;
since --xapian-only always goes through the full history at
the moment.
|
|
Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness. The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.
|
|
If XAPIAN_FLUSH_THRESHOLD is unset, Xapian will default to
10000. That limits the effectiveness of users specifying
extremely large values of --batch-size.
While we're at it, localize the changes to globals since -index
may be eval-ed in tests (and perhaps production code in the
future).
|
|
Since the --compact switch works on Xapian shards,
it makes sense that --sequential-shard affects our
usage of xapian-compact(1).
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|
|
This is useful for speeding up indexing runs when only Xapian
rules change but SQLite indexing doesn't change. This mostly
implies `--reindex', but does NOT pick up new messages (because
SQLite indexing needs to occur for that).
I'm leaving this undocumented in the manpage for now since it's
mainly to speed up development and testing. Users upgrading to
1.6.0 will be advised to `--reindex --rethread', anyways, due to
the threading improvements since 1.1.0-pre1.
It may make sense to document for 1.7+ when there's Xapian-only
indexing changes, though.
|
|
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.
Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately. The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.
Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard. Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.
Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
|
|
Thanks to the GCC compile farm project, we can wire up syscalls
for sparc64 and set system-specific SFD_* constants properly.
I've FINALLY figured out how to use POSIX::SigSet to generate
a usable buffer for the syscall perlfunc. This is required
for endian-neutral behavior and relevant to sparc64, at least.
There's no need for signalfd-related stuff to be constants,
either. signalfd initialization is never a hot path and a stub
subroutine for constants uses several KB of memory in the
interpreter.
We'll drop the needless SEEK_CUR import while we're importing
O_NONBLOCK, too.
|
|
We used ->header_obj in the past as an optimization with
Email::MIME. That optimization is no longer necessary
with PublicInbox::Eml.
This doesn't make any functional difference even if we were to
go back to Email::MIME. However, it reduces the amount of code
we have and slightly reduces allocations with PublicInbox::Eml.
|
|
This is more accurate given we use PublicInbox::Eml instead
of Email::MIME/PublicInbox::MIME, nowadays.
|
|
Tests for failures should not leave junk temporary files lying
around in a users' ~/.public-inbox/.
On a side note, I'm not sure if PI_DIR is or was ever
necessary. It's never been documented, so perhaps
using $HOME for this is better...
|
|
And -compact supports --jobs=0 like -index to disable parallel
execution. Running three xapian-compact processes in parallel
on a USB 2.0 HDD is pretty painful.
|