Date | Commit message (Collapse) |
|
We'll probably be using JSON more in the future, so make
it easier to require in tests
|
|
Still unstable, this builds off the equally unstable extindex :P
This will be used for caching/memoization of traditional mail
stores (IMAP, Maildir, etc) while providing indexing via Xapian,
along with compression, and checksumming from git.
Most notably, this adds the ability to add/remove per-message
keywords (draft, seen, flagged, answered) as described in the
JMAP specification (RFC 8621 section 4.1.1).
We'll use `.' (a single period) as an $eidx_key since it's an
invalid {inboxdir} or {newsgroup} name.
|
|
In an attempt to ensure a coherent UI/UX, we'll try to document
all proposed commands and options in one place for easy reference
|
|
The start of lei, a Local Email Interface. It'll support a
daemon via FD passing to avoid startup time penalties if
IO::FDPass is installed, but fall back to a slow one-shot mode
if not.
Compared to traditional socket daemon, FD passing should allow
us to eventually do stuff like run "git show" and still have
proper terminal support for pager and color.
|
|
It's likely most GNU/Linux systems have /etc/machine-id these
days, so anything missing it is likely a *BSD, most of which
support and favor "sysctl -n kern.hostid". We'll also support
"ghostid" since GNU utils are commonly prefixed with 'g' on
non-GNU platforms.
In any case, we'll suppress stderr from missing commands and
fall back to hard coding an $OSNAME-based identifier as a last
resort and hope the hostname is unique.
|
|
So we don't trigger an uninitialized variable warning :x
|
|
It's actually supported by mutt, dovecot[1], and likely some other
software to augment the Status: header. While dovecot doesn't
expose X-Status to clients, mutt will write 'A' (answered) and
'F' to X-Status (but not T (draft)).
So we'll drop it like we do Status since it's not suitable for
public mail, but stick it in an @UNWANTED_HEADERS array will
allow us to configure an override if needed.
[1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
|
|
extindex treats v1/v2 public inboxes as read-only, so there's
no need to scare people by using the InboxWritable package
now that ->git_dir_n is gone and we can use ->max_git_epoch
instead of ->git_dir_latest.
|
|
There's only one caller, unlikely to be any more, and
should be harmless to open code.
|
|
Perl readdir detects list context and can return an array
suitable for the grep op. From there, we can rely on
substr to remove the ".git" suffix and integerize the value
to save a few bytes before letting List::Util::max return
the value.
This is how we detect Xapian shards nowadays, too, and
we'll also use defined-or (//) to simplify the return
value there.
We'll also simplify InboxWritable->git_dir_latest,
remove some callers, and consider removing it entirely.
|
|
-index runs on data that's already frozen in git, so there's
no point in warning users about it.
While we're at it, set the {current_info} prefix for v1 as
we do in v2 inboxes in case new problems show up.
|
|
As with the other messages in this callback, there's
nothing we can do about invalid messages ending up in
our Maildirs for -watch.
|
|
Incremental indexing can use the `eidxq' reindexing queue for
handling deletes and resuming interrupted indexing. Ensure
those incremental -extindex invocations do not steal (and
prematurely perform) work that an "-extindex --reindex"
invocation is handling.
|
|
This overdue change fixes {current_info} to not inject a newline
into every warning message.
Simpler code helps us avoid bugs and the need to make
fixes like commit 44de182766037948d62bc2a8ba924de2264dd5fc
("searchidxshard: chomp $eidx_key from pipe").
|
|
When checkpointing and yielding the lock to other processes,
we need to ensure any open DB statement handles are closed,
since they reference and prevent DB FDs from being closed
and unlocked.
And clean up some progress reporting while we're at it.
|
|
Since we're inside a Xapian transaction, calling ->index_raw
followed by ->shard_add_eidx_info calls on the same docid
doesn't seem to hurt indexing performance. It definitely
reduces FS read traffic and IPC from git at the cost of some
more IPC between the parent and workers. Nevertheless, the code
and FD reductions seem worth it.
|
|
--reindex can take many hours or days, ensure we release
locks according to --batch-size so automated fetch+index
jobs can write new data to indices while we update old data.
|
|
Instead of just working on over.sqlite3, we need to work on
the Xapian DBs as well. While no changes to our Xapian use
have taken place recently, they could in the future and
--reindex exists to account for that.
|
|
--rethread is useful for dealing with bugs and behaves
just like it does with current inboxes.
This is in case our content deduplication logic changes for
whatever reason and causes previously merged messages to be
considered "different". As with v2, this won't allow us to
merge messages in a way that allows deduplicating messages which
were previously considered different, but v2 inboxes do not
allow that, either.
In other words, this makes the --reindex and --rethread
switches of -extindex match the behavior of v2 -index.
|
|
While unlikely to happen, it may be possible for messages
from the same inbox to get indexed multiple times. Provide
consistent results in this case for ease-of-testing.
|
|
In addition to removing stale messages from Xapian, we must
also remove them from over.sqlite3.
|
|
--reindex allows us to catch missed and stale messages due to
-extindex vs -index races prior to commit 02b2fcc46f364b51
("extsearchidx: enforce -index before -extindex").
We'll also rely on reindex to internally deal with v1/v2 inbox
removals and partial-unindexing of messages which are only
removed from one inbox out of many.
This reindex design is completely different than how normal
v1/v2 inbox reindex operates due to extindex having multiple
histories to work with. Instead of scanning git history, this
relies exclusively on comparing over.sqlite3 contents between
the v1/v2 inboxes and the extindex.
Changes to Xapian behavior also get picked up, now. Xapian indexing
is handled by workers with minimal IPC to the parent process.
This results in more read I/O but fewer writes when dealing
with cross-posted messages.
Changes to $smsg->populate and --rethread still need further
work.
|
|
While totally unindexed inboxes are rare, we still support
them for v1 and may hit code which calls this method. Just
return `undef' when ->mm access fails.
|
|
Avoid confusing hackers since this conflicts with a method name
provided by (Search::)Xapian::QueryParser.
|
|
The defined-or `//' operator in 5.10 allows us to golf down
our code slightly.
|
|
We don't actually need Net::Server::Daemonize to support
the --daemonize flag, since the daemonize() sub provided
by N::S::D doesn't exactly do the things we want.
|
|
There's no need to have extra code in the Inbox package for this
or to waste dozens of bytes for every Inbox object which uses
the default value.
This makes our code more flexible w.r.t Inbox-like ExtSearch
objects and fixes uninitialized value warnings with ->ALL.
|
|
These headers can conflict with headers in the DKIM signature;
and parsing the DKIM-Signature header to determine whether or
not we can safely add a header would be more code and CPU
cycles.
Since IMAP seems fine without these headers (and JMAP will
likely be, too), there's likely no need to continue appending
these to every message. Nowadays, developers seem sufficiently
trained to use URLs with Message-IDs in them. So drop the
headers and save some cycles and bandwidth all around.
|
|
If a message can't be found in ->ALL, we shouldn't attempt to
enter code paths which iterate normal inboxes or attempt to
access non-existent fields (e.g. {name}, {newsgroup},
{inboxdir}) in the ExtSearch object.
|
|
We'll be storing private data inside the "" (empty string) key
of the JSON doc we use for store for manifest.js.gz generation.
This private data will allow us to reduce FS activity at and
speed up startup times, but some will also be in Xapian boolean
terms and values for searching and filtering.
|
|
We cannot set xref3 data without the `xnum' column to
tie it to the per-inbox over.sqlite3 DB. So ensure we don't
read brand-new history that only exists in git, but instead
rely on last_commit and last_xap15-$EPOCH metadata in msgmap
to decide how far we can index.
Before this change, it was possible to miss messages in
the extindex if -index did not run (which will be fixable by
upcoming --reindex support in -extindex).
|
|
This should help us detect bugs in our code or storage
synchronization problems more easily. This probably won't
detect corrupted git storage, but can detect corrupted SQLite
files.
"Bad blobs, bad blobs, whatcha gonna do when they come for you?"
|
|
Since extindex is an amalgamation of several inboxes, discerning
an appropriate address for List-Post: would be expensive and
most likely unnecessary. Some legacy/historical inboxes may
have no active address, either, so don't attempt to set the
List-Post header if no addresses are configured.
|
|
The content_hash() hash in the same scope may trigger warnings
for a given blob, so ensure we correctly report the blob where
it happens.
|
|
We've stopped referring to inboxdirs as "repos" a while ago
since v2 inboxes have multiple git repos associated with them.
So update the name to reflect that and avoid an unnecessary
export that's only used by a test case.
|
|
At least not for resolving inboxes, since there's no good way
for a user to specify what is an inbox or extindex directory
without a command-line switch.
Instead of changing the -extindex command, we change the -index
command internals to rely on the new {-use_cwd} flag to avoid
internal use of negation, since double-negatives and the like
are confusing to me.
|
|
{pi_config} may be confused with the documented `PI_CONFIG'
environment variable, and we'll favor vowel-removal to be
consistent with our usage of object references.
The `pi_' prefix may stay in some places, for now; since a
separate namespace may come into this codebase for local/private
client-tooling.
For InboxIdle, we'll also remove an invalid comment about
holding a reference to the PublicInbox::Config object, too.
|
|
They're PublicInbox::Inbox objects just like the rest of
the non-NNTP code. So rename the NNTP code for consistency
with the rest of the codebase. Furthermore, {ng} and $ng
may be confused with the `--ng' switch for -init, and that's
a non-ref scalar string.
|
|
{ibx} is shorter and is the most prevalent abbreviation
in indexing and IMAP code, and the `$ibx' local variable
is already prevalent throughout.
In general, the codebase favors removal of vowels in variable
and field names to denote non-references (because references are
"lighter" than non-references).
So update WWW and Filter users to use the same code since
it reduces confusion and may allow easier code sharing.
|
|
User-supplied queries (via PublicInbox::IMAPsearchqp) may
restrict messages to certain UID ranges in addition to the
limits we impose ourselves for mailbox slices. So we'll
continue to ask Xapian::QueryParser to "uid:" numeric ranges.
Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
|
|
This improves consistency with sibling methods such as
->shard_remove_eidx_info and ->add_xref3. Passing the
$eidx_key scalar is preferable to the entire $ibx object
for IPC-friendliness.
|
|
Xapian docids have been tied to the over {num} column for
nearly 3 years, now; and OIDs are no longer stored in Xapian
document data. There's no need to increase code and IPC
complexity by passing the OID around.
|
|
There is no need to verify checksums of data already stored in
git. Doing this ourselves also limits flexibility in moving to
other hashes.
|
|
This makes things a little less noisy and will be
called by ExtSearchIdx.
|
|
While "public-inbox-extindex --gc" invocations try to ensure
proper ordering, it is still possible for users to change
the `inboxes' tables via sqlite3(1) or similar means. So
show a "missing://ibx_id=$ibx_id" placeholder to avoid undefined
variable warnings.
URLs such as "imaps://..." will eventually be supported as
eidx_keys, so having a URL-like "missing://" as a placeholder
probably makes sense.
|
|
INTEGER PRIMARY KEY can be an alias for ROWID in SQLite and is
already unique, so there's no need for a separate UNIQUE(num)
index.
With a smallish ~3K, freshly indexed v2 inbox, this results in a
~40K space savings, reducing over.sqlite3 from 1.375M to 1.335M
(post-VACUUM).
This only affects newly-indexed inboxes; existing DBs will
require manual intervention to take advantage of space savings.
Link: https://www.sqlite.org/rowidtable.html
|
|
Since IMAP search (either with Isearch or traditional per-Inbox
search) only returns UIDs, we can safely set the limit to the
UID slice size(*). With isearch, we can also trust the Xapian
result to fit any docid range we specify.
Limiting Xapian results to 1000 was making ->ALL docid <=>
per-Inbox UID impossible since results could overlap between
ranges unpredictably.
Finally, we can map the ->ALL docids into per-Inbox UIDs and
show them to the client in the UID order of the Inbox, not the
docid order of the ->ALL extindex.
This also lets us get rid of the "uid:" query parser prefix
and use the Xapian::Query API directly to reduce our search
prefix footprint.
For mbox.gz downloads in WWW, we'll also make a best effort to
preserve the order from the Inbox, not the order of extindex;
though it's possible large result sets can have non-overlapping
windows.
(*) by definition, UID slice size is a "safe" value which
shouldn't OOM either the server or clients.
|
|
Using "eidx_key:" boolean prefix to limit results to a given
inbox, we can use ->ALL to emulate and replace per-Inbox
xap15/[0-9] search indices.
With this change, the presence of "extindex.all.topdir" in the
$PI_CONFIG will cause the WWW code to use that extindex and
ignore per-inbox Xapian DBs in xap15/[0-9].
Unfortunately IMAP search still requires old per-inbox indices,
for now. Mapping extindex Xapian docids to per-Inbox UIDs and
vice-versa is proving tricky. Fortunately, IMAP search is
rarely used and optional. The RFCs don't specify expensive
phrase search, either, so `indexlevel=medium' can be used in
per-inbox Xapian indices to save space.
For primarily WWW (and future JMAP) users; this should result in
significant disk space, FD, and page cache footprint savings for
large instances with many inboxes and many cross-posted
messages.
|
|
Stop leaking WWW/PSGI-specific logic into classes like
PublicInbox::Inbox, which is used universally.
We'll also decouple $ibx->over from $ibx->search and just deal
with duplicate the code inside ->over to reduce argument
complexity in ->search.
This is also a step in moving away from using {psgi.errors}
to ease code sharing between IMAP, NNTP, and command-line
interfaces. Perl's built-in `warn' and `local $SIG{__WARN__}'
provides all the flexibility we need to control warning output
and should be universally understood by Perl hackers who may
be unfamiliar with PSGI.
|
|
As with NewsWWW and NNTP, we can use ->ALL to completely
avoid trying SQLite/Xapian lookups across hundreds/thousands
of inboxes.
|