Date | Commit message (Collapse) |
|
The Xapian SWIG bindings are favored by Xapian upstream for
ease-of-maintenance compared to the XS version. While Debian
lags on this front, the SWIG bindings are widely available
on all *BSDs.
|
|
Reported-by: Kyle Meyer <kyle@kyleam.com>
Link: https://public-inbox.org/meta/87leeovmig.fsf@kyleam.com/
|
|
This enables Xapian::DB_DANGEROUS to support in-place updates.
This can speed up the initial index and reduce I/O at the cost
of preventing concurrent readers and being unsafe in the face of
any abnormal terminations. This is more dangerous than
--no-fsync. --no-fsync is only unsafe in the event of a power
loss or kernel crash; --dangerous is unsafe even on SIGKILL.
|
|
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
|
|
e226f18934eb7291 modified the lei-q manpage so that each variant of an
option gets a dedicated =item to make L</--xyz> look nicer and to
follow the Perl core documentation. Do the same for the other
manpages.
Note that this still leaves the variants of an option grouped in one
scenario: when a list of options without descriptions is presented as
a pointer to another location. Splitting the variants in that case
would make it harder for the reader to tell what the distinct options
are.
|
|
v2 onions are insecure, deprecated and going away. v3 names are
unfortunately longer and more difficult to remember, but should
be more resistant to attack than v2 ones.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
In most cases, this ensures users will only have to opt-in to
using -extindex once and won't have to issue extra commands
to keep external indices up-to-date when using
public-inbox-index.
Since we support arbitrary numbers of external indices for
ease-of-development, we'll support repeating "-E"
("--update-extindex=") in case users want to test changes in
parallel.
|
|
I should've dropped "PENDING" notes before the 1.6 release;
they're dropped now, and a note is added to remind my future
self to drop them before 1.7.
|
|
And change the documentation reference in -tuning to
point to the -index manpage while we're at it.
|
|
I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.
The Xapian MultiDatabase problem will need to be addressed
for 1.7...
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
For -index, this is a convenient way to quickly index all
inboxes after a grok-pull. Might as well support it for
rarely used commands like -compact and -xcpdb, too.
|
|
Move away from hard-to-read alllowercase naming and favor
snake_case or separated-by-dashes.
We'll keep `--indexlevel' as-is for now, since it's been around
for several releases; but we'll support `--index-level' in the
CLI and update our documentation in a few months.
We'll also clarify that publicInbox.indexMaxSize is only
intended for -index, and not -watch or -mda.
|
|
We parse other options, too, not just --max-size
|
|
With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful. I was able to index LKML in ~16 hours on a
system that had other activity on it. The big downside was it
was eating up over 5g of RAM :x.
We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
|
|
Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness. The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.
|
|
We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.
None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
|
|
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.
Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately. The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.
Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard. Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.
Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
|
|
This allows us to speed up indexing operations to SQLite
and Xapian.
Unfortunately, it doesn't affect operations using
`xapian-compact' and the compactor API, since that doesn't seem
to support Xapian::DB_NO_SYNC, yet.
|
|
Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases. This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
|
|
grok-pull is still painful with serialization on an old USB 2.0
HDD, but at least it can finish with flock(1) and disabling
parallelization. While parallel "git fetch" doesn't seem so
bad, slow seeks are exacerbated by parallel reads in Xapian.
That means some updates can take days instead of hours. The
same updates take only seconds or minutes on an SSD.
|
|
Update release notes with some features in the 1.6 timeline.
We'll note the version availability of some command-line
options, it may help users who are reading the latest
documentation online but running older versions.
|
|
On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).
Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
|
|
As an established project (:P), it's important to document when
new features appear in manpages. Users may be reading new
documentation online which doesn't reflect an older version they
have installed.
|
|
In normal mail paths, we can rely on MTAs being configured with
reasonable limits in the -watch and -mda mail injection paths.
However, the MTA is bypassed in a git-only delivery path, a BOFH
could inject a large message and DoS users attempting to mirror
a public-inbox.
This doesn't protect unindexed WWW interfaces from Email::MIME
memory explosions on v1 inboxes. Probably nobody cares about
unindexed WWW interfaces anymore, especially now that Xapian is
optional for indexing.
|
|
It's more convenient to specify `-c' / `--compact' on the
command-line when reindexing than it is to invoke
public-inbox-compact(1) separately.
This is especially convenient in low-space situations when
public-inbox-index is operating on multiple inboxes
sequentially, as compaction can happen immediately after
indexing each inbox, instead of waiting until all inboxes are
indexed.
|
|
Since v2 inboxes contain multiple git repositories, avoid the
use of the word "repository" when referring to inboxes as a
whole in most places.
|
|
I didn't wait until September to do it, this year!
|
|
It's likely a user will be low on space after running --reindex,
so recommend the use of public-inbox-compact afterwards.
And add a few more notes about using public-inbox-compact to
clarify it's for inboxes-only (and not any old Xapian DBs) that
using xapian-compact(1) directly is error-prone and likely to
break things.
|
|
* origin/edit:
edit: unlink temporary file when done
v2writable: replace: kill git processes before reindexing
edit: drop unwanted headers before noop check
edit|purge: improve output on rewrites
edit: new tool to perform edits
doc: document the --prune option for -index
admin: expose ->config
AdminEdit: move editability checks from -purge
admin: beef up resolve_inboxes to handle purge options
purge: start moving common options to AdminEdit module
admin: remove warning arg for unconfigured inboxes
v2writable: implement ->replace call
import: switch to "replace_oids" interface for purge
import: extract_author_info becomes extract_commit_info
v2writable: consolidate overview and indexing call
|
|
|
|
We've had it around for a while, but I forgot to document it :x
|
|
Oops :x
|
|
-index documentation avoid redundant v1 information and refers
readers to apropriate v1/v2 manpages. Search::Xapian can also
be optional, now, as only the PSGI search interface uses it.
Favor "INBOX_DIR" where appropriate, since "REPO_DIR" can be
confused for code repos which we also support.
XAPIAN_FLUSH_THRESHOLD is documented for all relevant
bulk commands.
|
|
No functional changes, yet, but this makes future changes
easier-to-read.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
And include it into the build + website
|
|
Hopefully more folks can download and run public-inbox,
nowadays.
|