Date | Commit message (Collapse) |
|
WAL actually seems to have ideal locking characteristics given
concurrency problems I'm experiencing with --reindex running
in parallel with expensive read-only SQLite queries:
<https://public-inbox.org/meta/20200825001204.GA840@dcvr/>
Unfortunately, we cannot blindly use WAL while preserving
compatibility with existing setups nor our guarantees that
read-only daemons are indeed "read-only".
However, respect an user's the choice to set WAL on their
own if they're comfortable with giving -nntpd/-httpd/-imapd
processes write permission to the directory storing SQLite DBs.
|
|
It's fewer queries and matches what we do in OverIdx.
|
|
This file gets truncated anyhow, so it won't fragment.
|
|
A few more things happened, here.
|
|
I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.
The Xapian MultiDatabase problem will need to be addressed
for 1.7...
|
|
croak() can give more context on the failure, and setting
`PERL5OPT=-MCarp=verbose' can force a stacktrace.
|
|
We've got examples for all the other daemons, too!
|
|
There's no reason we'd want Xapian to defer flushing once we've
indexed everything belonging to a particular shard.
|
|
Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.
To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).
This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.
|
|
Finally, the addition of THREADID for collapsing results
in Xapian lets us emulate the "mairix --threads" feature.
That is, instead of returning only the matching messages,
the entire thread is included in the downloaded mbox.gz
This requires a "public-inbox-index --reindex" to be usable.
|
|
This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
|
|
We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host. These
methods only pass messages to a child process on the same host
OR perform operations within the same process.
|
|
Merely assigning `undef' to a scalar does not free the
underlying buffer memory of a scalar.
|
|
Unlike w3m and links, the lynx browser seems to require a `name'
attribute for `<input type=submit>' elements. Maybe some other
browsers do, too. The `name' attribute for submit elements
doesn't seem to cause any harm for w3m or links, users, either;
despite not (AFAIK) being part of historical or current HTML
specs.
|
|
We can avoid importing mdocid() in several places by using
this method, simplifying callers.
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
Numbers are hard :<
|
|
We no longer read docdata.glass from anywhere in our code base.
Some adjustments were needed to t/search.t to deal with the
Xapian::WritableDatabase committing at different times, since
our ->query is avoided from PublicInbox::SearchIdx to avoid
needing a {over_ro} field.
|
|
Another place where we can reduce kernel page cache overhead
by hitting over.sqlite3 instead of docdata.glass.
|
|
Once again, over.sqlite3 contains everything necessary for
Message-ID resolution. Also, Xapian may be completely
unnecessary with the advent of over.sqlite3, but that's for
another time.
|
|
git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.
Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.
|
|
Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query. Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.
We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.
|
|
This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.
|
|
Both callers of load_from_data call utf8::decode, so just
do utf8::decode in load_from_data.
|
|
We'll probably be reusing it from another package in a future commit.
|
|
Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.
|
|
No need to have awkward globrefs for this.
|
|
We'll probably be adding more value columns like THREADID to sort
on.
|
|
While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.
In the future, it will interfere with inbox search grouping
and lead to unexpected results.
|
|
No need to localize it, here, since we can just refer to it
in the `$opt' hashref. Hopefully this improves readability
for others like it does for me.
I sometimes wonder if the concept of a stack in high-level
languages is even necessary...
|
|
This seems required to correctly get the NNTP article number
from Xapian docid on combined Xapian DBs. The default
(ASCII-betical) sorting was only acceptable for -imapd users
until somebody hit 11 (or more) shards, which is a rare case.
|
|
It may be too easily confused for --newsgroup or --ng. This is
too rarely used and never made it into a release, so it should
be fine.
|
|
We can reduce the need to edit the config file for NNTP group names
this way.
|
|
And speed those up with some lazy loading, too.
|
|
This probably won't be used much, but --help can still
make sense.
|
|
This is helpful with --all, or when multiple inboxes
are being indexed.
|
|
Slowly improving the learning curve...
|
|
Otherwise things get very confusing when verbosity is enabled :x
|
|
There may be messages in the wild with wide characters in
headers which aren't non-RFC2047 encoded. Assume UTF-8 so
those fields can round trip through over.sqlite3.
This doesn't affect docdata.glass in Xapian, but it does
affect how over.sqlite3 stores the same deflated info.
|
|
Determining storage device speed and latencies doesn't
seem portable or even possible with the wide variety
of storage layers in use.
This means we need to write a tuning document and hope
users read and improve on it :P
|
|
--sequential-shard offers better performance on HDD than -j0
since the on-disk active set can be kept small (with -j $HIGH_NUM).
--batch-size can also be helpful for systems with much RAM.
|
|
For -index, this is a convenient way to quickly index all
inboxes after a grok-pull. Might as well support it for
rarely used commands like -compact and -xcpdb, too.
|
|
We use IdxStack via log2stack() from SearchIdx, now.
|
|
--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.
There was a missing "use strict", too, which is fixed.
|
|
Established tools like make(1), prove(1) and xargs(1) don't warn
when the desired parallelism level can't be met, either.
|
|
In case there's unbalanced shards AND we're limiting parallelism
while using many shards, spawn the next task in the queue ASAP
once a task is done, instead of waiting for all tasks to finish
before spawning the next batch.
Unbalanced shards probably isn't a big issue for most users;
however many smaller shards with few jobs can be useful for HDD
users to reduce the effect of random writes.
|
|
This was omitted in 8b1950055d51d436 :x
Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'")
|
|
We don't need to fully-qualify when referring to subs in
the same namespace, nor do we need make a SCALAR ref only
to dereference it
(Yes, still learning Perl :x)
|
|
We'll use our existing logic and use sqlite_backup_from_file,
which appeared in 1.39 (along with sqlite_backup_to_file).
|
|
Instead of silently ignoring excessive args, don't let a user
specify an extra directory. Furthermore, we'll support the odd
case where BOFH wants to name an $INBOX_DIR to be `0' :P
|