Date | Commit message (Collapse) |
|
getpid() isn't cached by glibc nowadays and system calls are
more expensive due to CPU vulnerability mitigations. To
ensure we switch to the new semantics properly, introduce
a new `on_destroy' function to simplify callers.
Furthermore, most OnDestroy correctness is often tied to the
process which creates it, so make the new API default to
guarded against running in subprocesses.
For cases which require running in all children, a new
PublicInbox::OnDestroy::all call is provided.
|
|
We use this in various places to minimize or maximize pipe
size on Linux. So keep it all in one place.
|
|
We'll also drop the "\n" for die() to make diagnostics easier.
There's no known bugs in this area, just consistency
improvements and LoC reduction.
|
|
Just something I noticed while considering using this package
for CodeSearchIdx.
|
|
->newsgroup_matches was never used, and ->shard_over_check
was dropped in 89193578d21f (extindex: --gc checkpoints, 2021-10-06).
|
|
Xapian shard cleanup only requires read-only access to
over.sqlite3, so avoid opening it with read-write access since
create_tables will hit lock conflicts on "INSERT OR IGNORE"
statements.
|
|
We can more clearly distinguish between v1 and v2-only code
paths this way, and may be able to save a few cycles this way.
|
|
This fixes a performance regression in multi-process v2 indexing
due to the switch to PublicInbox::IPC. While Unix sockets are
fewer FDs to manage, pipes allow unprivileged processes to use
larger buffers (up to 1M) on out-of-the-box Linux instances.
A larger buffer via F_SETPIPE_SZ afforded by pipes was proven
valuable during v2 development in 2018 and continues to be
valuable when we get significant amounts of one-way traffic from
the producer parent to worker children.
Compression may be an option for systems without F_SETPIPE_SZ;
but it increases CPU usage with no memory bandwidth savings on
hosts where larger buffers are available.
|
|
Since Storable and Sereal are designed for lossless
serialization, we'll just pass $eml objects to whatever process
is running SearchIdx.
|
|
We can remove some now-pointless wrapper functions by using
->ipc_do in even more places.
|
|
It's nice to prove the new code works by swapping it into
the current V2Writable / SearchIdxShard packages. This is
only the first step for the core bits, and we'll be able
to delete more code in a subsequent patch.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
The daemon for the local email interface will be inside
the DS->EventLoop. -watch currently doesn't trigger this
bug since it doesn't enable parallelism, but it may in
the future.
|
|
Still unstable, this builds off the equally unstable extindex :P
This will be used for caching/memoization of traditional mail
stores (IMAP, Maildir, etc) while providing indexing via Xapian,
along with compression, and checksumming from git.
Most notably, this adds the ability to add/remove per-message
keywords (draft, seen, flagged, answered) as described in the
JMAP specification (RFC 8621 section 4.1.1).
We'll use `.' (a single period) as an $eidx_key since it's an
invalid {inboxdir} or {newsgroup} name.
|
|
This overdue change fixes {current_info} to not inject a newline
into every warning message.
Simpler code helps us avoid bugs and the need to make
fixes like commit 44de182766037948d62bc2a8ba924de2264dd5fc
("searchidxshard: chomp $eidx_key from pipe").
|
|
Since we're inside a Xapian transaction, calling ->index_raw
followed by ->shard_add_eidx_info calls on the same docid
doesn't seem to hurt indexing performance. It definitely
reduces FS read traffic and IPC from git at the cost of some
more IPC between the parent and workers. Nevertheless, the code
and FD reductions seem worth it.
|
|
--reindex allows us to catch missed and stale messages due to
-extindex vs -index races prior to commit 02b2fcc46f364b51
("extsearchidx: enforce -index before -extindex").
We'll also rely on reindex to internally deal with v1/v2 inbox
removals and partial-unindexing of messages which are only
removed from one inbox out of many.
This reindex design is completely different than how normal
v1/v2 inbox reindex operates due to extindex having multiple
histories to work with. Instead of scanning git history, this
relies exclusively on comparing over.sqlite3 contents between
the v1/v2 inboxes and the extindex.
Changes to Xapian behavior also get picked up, now. Xapian indexing
is handled by workers with minimal IPC to the parent process.
This results in more read I/O but fewer writes when dealing
with cross-posted messages.
Changes to $smsg->populate and --rethread still need further
work.
|
|
This improves consistency with sibling methods such as
->shard_remove_eidx_info and ->add_xref3. Passing the
$eidx_key scalar is preferable to the entire $ibx object
for IPC-friendliness.
|
|
Xapian docids have been tied to the over {num} column for
nearly 3 years, now; and OIDs are no longer stored in Xapian
document data. There's no need to increase code and IPC
complexity by passing the OID around.
|
|
Inboxes may be removed or newsgroups renamed over time.
Introduce a switch to do garbage collection and eliminate stale
search and xref3 results based on inboxes which remain in the
config file.
This may also fixup stale results leftover from any bugs which
may leave stale data around.
This is also useful in case a clumsy BOFH (me :P) is swapping
between several PI_CONFIGs and accidentally indexed a bunch of
inboxes they didn't intend to.
|
|
We were accidentally adding "\n" to terms (which Xapian happily
accepts), causing incompatibilities when enabling parallel
sharding in some invocations of -extindex but not others.
This is an extindex incompatibility and starting a new extindex
will be required to take advantage of in-development features,
so it's not urgent to start another one, either.
(other incompatible things may happen before a 1.7 release)
|
|
Just like the daemon processes, -extindex now supports graceful
shutdown via the same signals. This lets users avoid having to
repeat indexing messages when a power outage strikes during a
long (multi-hour/day) indexing run.
Per-inbox (v1/v2) -index graceful shutdowns are not supported,
yet, but is planned for later.
|
|
Add a space after \0 to visually disambiguate it from the
{bytes} field.
|
|
We use ->autoflush(1) on this pipe to ensure the shard workers
see data immediately on print; so this means we have to do our
own buffering for optional data.
|
|
Seeing "Xorg.foo.bar" can be confusing in warnings if the
eidx_key is only "org.foo.bar" with no relation to "Xorg" at
all. Furthermore, printing "\0" to log or terminal output isn't
very nice and could throw off some users/tools.
|
|
It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
|
|
Having a special init path for external indices is probably
easier than further overloading SearchIdx->new initialization
to work without an Inbox object.
|
|
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
|
|
We don't need to keep it in code paths which are guaranteed to
only see PublicInbox::Eml (and not Email::MIME or PublicInbox::MIME
which did not round-trip properly). However, we must set
{raw_bytes} since PublicInbox::Eml may add an extra "\n" for
rare messages with no bodies.
|
|
This will be used to track cross-posted messages in the
external/detached index.
|
|
This is the `tid' column from over.sqlite3; and will be used for
IMAP and JMAP search (among other things).
|
|
We'll also rename the /^remote_/ prefix to "shard_", since
remote implies the process is on a different host. These
methods only pass messages to a child process on the same host
OR perform operations within the same process.
|
|
Merely assigning `undef' to a scalar does not free the
underlying buffer memory of a scalar.
|
|
Since we no longer read document data from Xapian, allow users
to opt-out of storing it.
This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
|
|
This is useful for speeding up indexing runs when only Xapian
rules change but SQLite indexing doesn't change. This mostly
implies `--reindex', but does NOT pick up new messages (because
SQLite indexing needs to occur for that).
I'm leaving this undocumented in the manpage for now since it's
mainly to speed up development and testing. Users upgrading to
1.6.0 will be advised to `--reindex --rethread', anyways, due to
the threading improvements since 1.1.0-pre1.
It may make sense to document for 1.7+ when there's Xapian-only
indexing changes, though.
|
|
The "xdb" prefix was inaccurate since it's used by
indexlevel=basic, which is Xapian-free. The '_' (underscore)
prefix was also wrong for a method which is called across
package boundaries.
|
|
This is a step which makes our use of abbreviations more
consistent when referring to PublicInbox::Inbox objects.
We'll also be reducing the number of redundant fields
in SearchIdx and V2Writable code paths to make the
object graph easier-to-follow.
|
|
Since over.sqlite3 seems here to stay, we no longer need to do
Message-ID lookups against Xapian and can simply rely on the
docid <=> NNTP article number equivalancy SCHEMA_VERSION=15
gave us.
This rids us of the closure-using batch_do sub in the v1
code path and vastly simplifies both v1 and v2 unindexing.
|
|
We only support Unix-like platforms where binmode (":raw") is
the default anyways, and v5.10 semantics means it won't do
unicode_strings (unlike v5.12). So save some lines of code.
|
|
The "5.010_001" form was for Perl 5.6, which I doubt anybody
would attempt; so favor "v5.10.1" as it is more readable to
humans. Prefer "parent" to "base" since the former is lighter.
We'll also rely on warnings from "-w" globally (or not) instead
of via "use".
We'll also update "use" statements to reflect what's actually
used by V2Writable.
|
|
NNTP and IMAP both require CRLF conversions on the wire.
They're also the only components which care about
$smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3
and Xapian DBs..
This will allow us to optimize RFC822.SIZE fetch item in IMAP
without triggering size mismatch errors in some clients' default
configurations (e.g. Mail::IMAPClient), but not most others.
It could also fix hypothetical problems with NNTP clients that
report discrepancies between overview and article data.
|
|
In our inbox-writing code paths, ->getline as an OO method may
be confused with the various definitions of `getline' used by
the PSGI interface. It's also easier to do: "perldoc -f readline"
than to figure out which class "->getline" belongs to (IO::Handle)
and lookup documentation for that.
->print is less confusing than the "readline" vs "getline"
mismatch, but we can still make it clear we're using a real
file handle and not a mock interface.
Finally, functions are a bit faster than their OO counterparts.
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
Message-IDs can apparently contain spaces and other weird
characters. Ensure we pass those properly to shard subprocesses
when importing messages in parallel mode.
Our NNTP request parser does not deal with spaces in the
Message-ID, yet, and I don't expect most NNTP clients to,
either. Nor does the Net::NNTP client handle them in responses.
|
|
For sharded v2 repositories with few-enough messages, it is
possible for shard[0] to go unused and never trigger the
->commit_txn_lazy to set the indexlevel field in Xapian
metadata.
So set it immediately at initialization and avoid this case.
While we're at it, avoid triggering needless pwrite syscalls
from ->set_metadata by checking with ->get_metadata, first.
|
|
We can finally get rid of the awkward, ad-hoc use of V2Writable,
SearchIdx, and OverIdx args for passing {cotime} and {autime}
between classes.
We'll still use those git time fields internally within
V2Writable and SearchIdx for (re)indexing, but that's not
worth avoiding as a fallback.
|
|
We can pass fewer order-dependent args to V2Writable::do_idx and
SearchIdxShard::index_raw by passing the smsg object, instead.
|
|
We can pass blessed PublicInbox::Smsg objects to internal
indexing APIs instead of having long parameter lists in some
places. The end goal is to avoid parsing redundant information
each step of the way and hopefully make things more
understandable.
|
|
When indexing messages without Date: and/or Received: headers,
fall back to using timestamps originally recorded by git in the
commit object. This allows git mirrors to preserve the import
datestamp and timestamp of a message according to what was fed
into git, instead of blindly falling back to the current time.
|
|
I didn't wait until September to do it, this year!
|