Date | Commit message (Collapse) |
|
Indexing any inboxes requires SQLite and msgmap, so don't hide
exceptions if it fails.
|
|
When importing several sources in parallel via http(s) mboxrd,
we need to be able to get keywords of uncommitted documents
directly from shard workers. Otherwise, Xapian DocNotFound
errors happen because the read-only LeiSearch won't see
documents from uncomitted transactions. Keep in mind that it's
possible the keywords can be changed on-the-fly even for
uncommitted documents because of inotify watches from LeiNoteEvent.
|
|
This covers v1 inboxes, as well. We also guard the execution
since "PRAGMA optimize" was only introduced in SQLite 3.18.0
(2017-03-30)
|
|
The original Msgmap->new API was v1-specific and not necessary.
The ->new_file API now supports an $ibx object being passed to
it, simplify -no_fsync use. It will also make an upcoming
change easier...
|
|
Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10. We'll also add some checkpoints around over and
xref3 cleanups.
|
|
We need to ensure -extindex --gc runs don't prevent other
work from happening in the meantime. I actually caused
my -extindex to OOM due to the lack of checkpoints :x
We'll also hoist out the shard scanning into its own sub
in preparation for lei/store usage.
|
|
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
|
|
Xapian bindings may not be installed or be out-of-date w.r.t. the
Perl version, improve the visibility of errors in those cases.
Cleanup and drop some redundant checks while we're at it.
Cc: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://public-inbox.org/meta/87k0ky5mbd.fsf@toke.dk/
|
|
This default seems closer to reasonable on 64-bit systems which
are the norm these days. 32-bit systems gain 48K so it's an
even 1 MB, but we need to keep 32-bit systems from using too
much since there's still some ancient systems out there with
small inboxes.
|
|
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
|
|
I'm not sure how this happened (only once for me in March), but
it should not happen... In any case, we'll operate on the
lowest numbered docid and cull redundant index entries when
lei/store is open for read-write.
This also fixes the normal lei/store removal path to clean up
the xref3 table (since it's not done automatically for
public-facing -eidx due to the multi-list nature of it).
|
|
This saves some work and makes it easier to set volatile
metadata on a message at import time.
|
|
"lei q" now displays labels in JSON output, "lei mark"
can add or remove labels for any messages.
"lei ls-label" is supported, too.
Unfortunately, "lei q" won't hande "kw:" or "L:" for
external messages, they must be imported, first.
|
|
Only tested for keywords and labels with file inputs, so far;
but it seems to do what it needs to do. There's a bit more
redundant code than I'd like, and more opportunities for code
sharing in the future
"lei import" will be expanded to support +kw:$KEYWORD and
+L:$LABEL in the future.
|
|
Keyword storage for external-only messages was preventing
messages from being explicitly imported. Teach lei_store
to vivify keyword-only entries into fully-indexed messages
on import.
|
|
"lei q" now preserves changes per-message keywords across
invocations when it's --output (Maildir or mbox) is reused
(with or without --augment).
In the future, these changes will be monitored via inotify,
EVFILT_VNODE or IMAP IDLE, too.
Unfortunately, this currently prevents "lei import" from ever
importing a message that's in an external. That will be fixed
in a future change.
|
|
Since keywords and mailboxes (AKA labels) are separate things in
JMAP; and only keywords can map reliably to Maildir and mbox;
we'll keep them separate in our internal data representations,
too.
I initially wanted to call this just "meta" for "metadata", but
that might be confused with our mailing list name. "metadata"
is already used in Xapian's own API, to add another layer of
confusion.
"tags" was also considered, but probably confusing to notmuch
users since our "labels" are analogous to "tags" in notmuch,
and notmuch doesn't seem to cover "keywords" separately...
So "vmd" it is, since we haven't used this particular
three-letter-abbreviation anywhere before; and "volatile" seems
like a good description of this metadata since everything else
up to this point has been mostly WORM (write-once, read-many).
|
|
This fixes "m:", "l:", "f:", "t:", "c:", "dfn:", and "n:" search
prefixes under indexlevel=medium when mixed with indexlevel=full
inboxish. We need positional data for Message-IDs, List-Id,
email addresses and filenames for exact matches, though we still
want to support wildcards.
Fortunately the storage cost is still small as these prefixes
tend to be small compared to message bodies. These are NOT
boolean terms since wildcard support and partial matching is
desired.
|
|
We no longer read Xapian docdata and favor hitting over.sqlite3,
instead, as Xapian is less likely to be available than SQLite.
|
|
Having parse_references in OverIdx was awkward and Smsg is
a better place for it.
|
|
Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag
on the flintlock FD, causing "git cat-file" processes to
hold onto the lock and prevent subsequent Xapian::WritableDatabase
from locking the DB. So cleanup git processes after committing
the miscidx transaction.
|
|
We can more clearly distinguish between v1 and v2-only code
paths this way, and may be able to save a few cycles this way.
|
|
We don't need to be keeping the raw message around after it hits
git. Shard work now relies on Storable (or Sereal) and all of
the indexing code relies on the Email::MIME-like API of Eml to
access interesting parts of the message.
Similarly, smsg->{raw_bytes} is no longer carried around and we
do the CRLF adjustment when setting smsg->{bytes}.
There's also a small simplification to t/import.t while
we're in the area to use xqx instead of spawn/popen_rd.
|
|
We can remove some now-pointless wrapper functions by using
->ipc_do in even more places.
|
|
It's nice to prove the new code works by swapping it into
the current V2Writable / SearchIdxShard packages. This is
only the first step for the core bits, and we'll be able
to delete more code in a subsequent patch.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
* origin/master: (58 commits)
ds: flatten + reuse @events, epoll_wait style fixes
ds: simplify EventLoop implementation
check defined return value for localized slurp errors
import: check for git->qx errors, clearer return values
git: qx: avoid extra "local" for scalar context case
search: remove {mset} option for ->mset method
search: remove pointless {relevance} setting
miscsearch: take reopen from Search and use it
extsearch: unconditionally reopen on access
extindex: allow using --all without EXTINDEX_DIR
extindex: add undocumented --no-scan switch
extindex: enable autoflush on STDOUT/STDERR
extindex: various --watch signal handling fixes
extindex: --watch for inotify-based updates
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
default to CORE::warn in $SIG{__WARN__} handlers
inbox: name variable for values loop iterator
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
...
|
|
We'll count the number of log changes (regardless of index or
unindex) and only attach inboxes to ExtSearchIdx objects when
they get new work. We'll also reduce lock bouncing and only
update external indices after all per-inbox indexing is done.
This also updates existing v2 indexing/unindexing callers
to be more consistent and ensures unindex log entries update
per-inbox last commit information.
|
|
This simplifies all ->with_umask callers and opens the
door for further optimizations to delay/elide process spawning.
|
|
This brings -nntpd startup time down from ~35s to ~5s with 50K
inboxes.
Further improvements ought to be possible with deeper changes to
MiscIdx, since -mda having to load every inbox seems unreasonable;
but this general change is fairly unintrusive.
|
|
Values can be strings in Xapian, although we currently use
integer values exclusively. Give the wrapper a more appropriate
name in case we start using string columns.
For future-proofing, we'll now return `undef' on missing columns
and coerce the return value to an IV (integer value) to save
memory, as sortable_unserialise returns a PV (pointer value)
scalar despite it existing to support numeric values.
|
|
This reduces differences between v1 and v2 code, and
introduces ->xdb_shards_flat to provide read-only access
to shards without using Xapian::MultiDatabase. This
will allow us to combine shards of several inboxes
AND extindexes for lei.
|
|
Still unstable, this builds off the equally unstable extindex :P
This will be used for caching/memoization of traditional mail
stores (IMAP, Maildir, etc) while providing indexing via Xapian,
along with compression, and checksumming from git.
Most notably, this adds the ability to add/remove per-message
keywords (draft, seen, flagged, answered) as described in the
JMAP specification (RFC 8621 section 4.1.1).
We'll use `.' (a single period) as an $eidx_key since it's an
invalid {inboxdir} or {newsgroup} name.
|
|
-index runs on data that's already frozen in git, so there's
no point in warning users about it.
While we're at it, set the {current_info} prefix for v1 as
we do in v2 inboxes in case new problems show up.
|
|
Since we're inside a Xapian transaction, calling ->index_raw
followed by ->shard_add_eidx_info calls on the same docid
doesn't seem to hurt indexing performance. It definitely
reduces FS read traffic and IPC from git at the cost of some
more IPC between the parent and workers. Nevertheless, the code
and FD reductions seem worth it.
|
|
--reindex allows us to catch missed and stale messages due to
-extindex vs -index races prior to commit 02b2fcc46f364b51
("extsearchidx: enforce -index before -extindex").
We'll also rely on reindex to internally deal with v1/v2 inbox
removals and partial-unindexing of messages which are only
removed from one inbox out of many.
This reindex design is completely different than how normal
v1/v2 inbox reindex operates due to extindex having multiple
histories to work with. Instead of scanning git history, this
relies exclusively on comparing over.sqlite3 contents between
the v1/v2 inboxes and the extindex.
Changes to Xapian behavior also get picked up, now. Xapian indexing
is handled by workers with minimal IPC to the parent process.
This results in more read I/O but fewer writes when dealing
with cross-posted messages.
Changes to $smsg->populate and --rethread still need further
work.
|
|
This should help us detect bugs in our code or storage
synchronization problems more easily. This probably won't
detect corrupted git storage, but can detect corrupted SQLite
files.
"Bad blobs, bad blobs, whatcha gonna do when they come for you?"
|
|
Xapian docids have been tied to the over {num} column for
nearly 3 years, now; and OIDs are no longer stored in Xapian
document data. There's no need to increase code and IPC
complexity by passing the OID around.
|
|
Inboxes may be removed or newsgroups renamed over time.
Introduce a switch to do garbage collection and eliminate stale
search and xref3 results based on inboxes which remain in the
config file.
This may also fixup stale results leftover from any bugs which
may leave stale data around.
This is also useful in case a clumsy BOFH (me :P) is swapping
between several PI_CONFIGs and accidentally indexed a bunch of
inboxes they didn't intend to.
|
|
v1 and v2 inbox indexing now supports graceful shutdown checks
just like ExtSearchIdx. Additionally, we'll consistently
perform quit checks at the top of loops for consistency.
Interaction with the --xapian-only and --sequential-shard
options are a bit lacking, and will warn the user to use
"--reindex --xapian-only" to fix.
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
The initial "git log" invocation for a git epoch can be time
consuming, so check for graceful shutdown at each line to ensure
timely shutdowns and avoid SSD/HDD wear.
|
|
This will set us up for supporting graceful shutdown
on -index without repeating any work.
|
|
In case of other bugs or intentional corruption of over.sqlite3,
we don't want to attempt dereferencing a non-ref scalar when
calling ->mid_delete in the fallback code path.
Noticed while chasing another bug in extindex development...
|
|
This seems necessary for some cross-posted messages (and we did
it historically before we used over.sqlite3).
|
|
It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
|
|
In case we want to reuse code with ExtSearchIdx or V2Writable.
|
|
This will let us work consistently with both existing inboxes
and external indices.
|
|
We'll be needing it in ExtSearchIdx for the next commit.
|
|
Since we store {ibx} in $sync state, we no longer have to
pass it as an argument to log2stack.
|