Date | Commit message (Collapse) |
|
Some yak-shaving while I try to track down other bugs...
|
|
`dt:' documentation is redundant with `d:' approxidate support;
so drop `dt:' since mairix uses `d:'. We'll also document
`rt:' since there are legit messages from senders with broken
clocks.
Reduce indentation level of help texts to be in 2-space
increments to using too much horizontal space.
We'll always place IMAP ahead of NNTP since it's alphabetical
and there's likely more IMAP clients out there.
Add "--ng NEWSGROUP" to -init instructions if configured.
There's also some minor wording changes throughout.
|
|
Inbox->xdb does not exist, but this code path was apparently
never tested :x I noticed this on basic v2 inbox, but it could
happen with any v1/v2 inbox. Move ->num2docid into Search
so it's less awkward to use.
|
|
The cost of opening a Xapian DB (even with shards) isn't high,
so save some FDs and just close it. We hit Xapian far less than
over.sqlite3 and we discard the MSet ASAP even when streaming
large responses.
This simplifies our code a bit and hopefully helps reduce
fragmentation by increasing mortality of late allocations.
|
|
Xapian::QueryParser is attached to the Xapian::Database,
so holding onto the QueryParser was preventing us from
releasing DB handles if a query was performed.
|
|
`undef' entries still take up a slot in the hash table, and
cause the `exists' check to false-positive in ->cleanup_shards.
This should fully fix the (innocuous) messages introduced in
commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
|
|
Neither Inboxes nor ExtSearch objects were retrying correctly
when there are live git processes, but the inboxes were getting
rescanned for search or other reasons. Ensure the scan retries
eventually if there's live processes.
We also need to update the cleanup task to detect Xapian shard
count changes, since Xapian ->reopen is enough to detect any
other Xapian changes. Otherwise, we just issue an inexpensive
->reopen call and let Xapian check whether there's anything
worth reopening.
This also lets us eliminate the Devel::Peek dependency.
|
|
It's needless noise in syslogs for daemons and unnecessarily
alarming to users on the command-line.
|
|
While git respects a user's local timezone and returns
seconds-since-the-Epoch, we were unnecessarily and incorrectly
calling gmtime+strftime on its result. So ignore calling
gmtime+strftime when the strftime format is "%s", just feed
the output time from git directly to Xapian.
This is mainly for lei, which will likely run in a variety of
timezones. While we're at it, add a recommendation to use
TZ=UTC in public-inbox-httpd, in case there are (misguided :P)
sysadmins who set a non-UTC TZ.
|
|
Since extindex uses Xapian shards in a similar way to
v2 inboxes, we'll support -xcpdb (reshard+upgrade) and
-compact all the same to give admins tuning+upgrade
options.
|
|
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
|
|
Xapian DBs may be modified by a parallel process while we're
reading it, and Xapian's MVCC model places the burden on readers
to retry operations.
We'll also have retry_reopen croak instead of die on errors,
which ought to help us track down some "Document not found"
errors I've occasionally seen when using "lei <q|up>".
|
|
If a user specifies "d:" with a higher precision than it was
traditionally able to handle, switch transparently to "dt:".
This lowers the learning curve and improves DWIM-ness.
v2: fix "d:YYYYMMDD..$NEEDS_APPROXIDATE" case
|
|
"lei q" now displays labels in JSON output, "lei mark"
can add or remove labels for any messages.
"lei ls-label" is supported, too.
Unfortunately, "lei q" won't hande "kw:" or "L:" for
external messages, they must be imported, first.
|
|
These have been confusing to me in the past, too.
|
|
So far, searching by size has never been publicly documented,
and IMHO, of questionable utility. In any case, "z:" is what
mairix(1) uses, so it may be familiar to existing mairix users
(I've never used this prefix myself).
So far, this prefix is only used internally in tests and in
auto-translated queries from IMAP; thus this incompatible change
is unlikely to affect anyone.
|
|
The cleanup doesn't seem to matter, I initially thought I needed
to handle "" (two double quotes) explicitly because that's what
Xapian does to escape a double quote inside a double-quoted
phrase. It turns out we only need to be able to pass phrases
through to Xapian unmodified, and the existing group of
["\x{201c}\x{201d}] is sufficient for our purposes.
|
|
This is for consistency with --stdin and WWW front ends
which can't distinguish between phrase searches and
prefix ranges used for d:/dt:/rt:.
In any case, I expect users on the lei command-line are more
likely to use `5.days.ago' instead of `"5 days ago"'
|
|
This greatly improves the usability of d:, dt:, and rt: search
prefixes for users already familiar git's "approxidate" feature.
That is, users familiar with the --(since|after|until|before)=
options in git-log(1) and similar commands will be able to use
those dates in the WWW UI.
|
|
This fixes both an old bug in "lei q" argv handling and one
recent regression introduced with the change to use approxidate.
Field prefixes are also handled correctly inside parenthesized
statements when the field follows "(" without a separation
character.
Fixes: fbb7ccabbf54a405 ("lei q: use git approxidate with d:, dt: and rt: ranges")
|
|
Order doesn't matter when users are completely downloading
mboxrds onto the FS and then opening them with an MUA. The
MUA is expected to sort the results in the user's preferred
order.
However, lei can start streaming the results to its destination
Maildir (or eventually IMAP/JMAP mailbox) with an MUA already
open. This will let users see recent results sooner in their
MUA, as those tend to have a higher docid. This matches the
behavior of the HTML results, as well.
As a bonus, this is around ~5% faster in a one-off, informal
test case with 66k results. I expect this to hold true in all
all cases since git has always optimized storage to favor recent
objects.
|
|
This is necessary to avoid slowdowns with pathological cases
with many dates in the query, since each rev-parse invocation
takes ~5ms.
This is immeasurably slower with one open-ended range, but
already faster with any closed range featuring two dates which
require parsing via git.
|
|
Instead of having --(sent|received)-(before|after)=s
command-line switches, we'll just try to make sense of argv so
it's usable within parenthesized statements and such.
Given the negligible performance penalty with Inline::C
process spawning, we'll probably wire this up to the
WWW interface, too.
"d:" is for mairix compatibility. I don't know if "dt:" and
"rt:" will be too useful, but they exist because of IMAP
(and JMAP).
|
|
Nobody is expected to use long options, but for consistency
with mairix(1), we'll use the pluralized option throughout
(including existing PublicInbox::{Search,SearchView}).
Link: https://public-inbox.org/meta/20210206090119.GA14519@dcvr/
|
|
This isn't tested for now, so maybe it works.
|
|
Meaning "Received time", as it is the best description of the
value we use from the "Received:" header, if present. JMAP
calls it "receivedAt", but "rt:" seems like a better
abbreviation being in line with "dt:" for the "Date" header.
"Timestamp" ("ts") was potentially ambiguous given the presence
of the "Date" header.
|
|
Parallelism and interactivity with pager + SIGPIPE needs work;
but results are shown and phrase search works without shell
users having to apply Xapian quoting rules on top of standard
shell quoting.
|
|
The default $QP_FLAGS won't be set until after Xapian is
loaded, duh...
This fixes t/imapd.t with TEST_RUN_MODE=0
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
* origin/master: (58 commits)
ds: flatten + reuse @events, epoll_wait style fixes
ds: simplify EventLoop implementation
check defined return value for localized slurp errors
import: check for git->qx errors, clearer return values
git: qx: avoid extra "local" for scalar context case
search: remove {mset} option for ->mset method
search: remove pointless {relevance} setting
miscsearch: take reopen from Search and use it
extsearch: unconditionally reopen on access
extindex: allow using --all without EXTINDEX_DIR
extindex: add undocumented --no-scan switch
extindex: enable autoflush on STDOUT/STDERR
extindex: various --watch signal handling fixes
extindex: --watch for inotify-based updates
eml: fix undefined vars on <Perl 5.28
t/config: test --get-urlmatch for git <2.26
default to CORE::warn in $SIG{__WARN__} handlers
inbox: name variable for values loop iterator
inboxidle: avoid needless syscalls on refresh
inboxidle: clue users into resolving ENOSPC from inotify
...
|
|
While a single extindex combines multiple inboxes into a single
search index, extindex still requires up-front indexing on items
which can be searched. XSearch has no on-disk footprint itself
and uses Xapian DBs of existing publicinbox and extindex
("extinbox") exclusively.
XSearch still suffers from the multi-shard Xapian scalability
problems which led to the creation of extindex, but I expect the
number of shards to remain relatively low.
I envision users hosting public-inbox instances on their
workstations will only have two extindex combined by this, one
read-only extindex for serving public archives, and one
read-write extindex managed by LeiStore for private mail.
|
|
The ->mset method always returns a Xapian mset nowadays, so
naming a parameter {mset} is too confusing. As it does with
MiscSearch, setting the {relevance} parameter to -1 now sorts by
ascending docid order. -2 is now supported for descending
docid order, too, since it may be useful for lei users.
|
|
SearchView will set it to `undef', others will set the 'mset'
option (for the ->mset method :P) to 2 which causes {relevance}
to be ignored.
And the 'mset' option is poorly named now that the message
is named ->mset...
|
|
This brings -nntpd startup time down from ~35s to ~5s with 50K
inboxes.
Further improvements ought to be possible with deeper changes to
MiscIdx, since -mda having to load every inbox seems unreasonable;
but this general change is fairly unintrusive.
|
|
This reduces differences between v1 and v2 code, and
introduces ->xdb_shards_flat to provide read-only access
to shards without using Xapian::MultiDatabase. This
will allow us to combine shards of several inboxes
AND extindexes for lei.
|
|
Perl readdir detects list context and can return an array
suitable for the grep op. From there, we can rely on
substr to remove the ".git" suffix and integerize the value
to save a few bytes before letting List::Util::max return
the value.
This is how we detect Xapian shards nowadays, too, and
we'll also use defined-or (//) to simplify the return
value there.
We'll also simplify InboxWritable->git_dir_latest,
remove some callers, and consider removing it entirely.
|
|
User-supplied queries (via PublicInbox::IMAPsearchqp) may
restrict messages to certain UID ranges in addition to the
limits we impose ourselves for mailbox slices. So we'll
continue to ask Xapian::QueryParser to "uid:" numeric ranges.
Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
|
|
Since IMAP search (either with Isearch or traditional per-Inbox
search) only returns UIDs, we can safely set the limit to the
UID slice size(*). With isearch, we can also trust the Xapian
result to fit any docid range we specify.
Limiting Xapian results to 1000 was making ->ALL docid <=>
per-Inbox UID impossible since results could overlap between
ranges unpredictably.
Finally, we can map the ->ALL docids into per-Inbox UIDs and
show them to the client in the UID order of the Inbox, not the
docid order of the ->ALL extindex.
This also lets us get rid of the "uid:" query parser prefix
and use the Xapian::Query API directly to reduce our search
prefix footprint.
For mbox.gz downloads in WWW, we'll also make a best effort to
preserve the order from the Inbox, not the order of extindex;
though it's possible large result sets can have non-overlapping
windows.
(*) by definition, UID slice size is a "safe" value which
shouldn't OOM either the server or clients.
|
|
Using "eidx_key:" boolean prefix to limit results to a given
inbox, we can use ->ALL to emulate and replace per-Inbox
xap15/[0-9] search indices.
With this change, the presence of "extindex.all.topdir" in the
$PI_CONFIG will cause the WWW code to use that extindex and
ignore per-inbox Xapian DBs in xap15/[0-9].
Unfortunately IMAP search still requires old per-inbox indices,
for now. Mapping extindex Xapian docids to per-Inbox UIDs and
vice-versa is proving tricky. Fortunately, IMAP search is
rarely used and optional. The RFCs don't specify expensive
phrase search, either, so `indexlevel=medium' can be used in
per-inbox Xapian indices to save space.
For primarily WWW (and future JMAP) users; this should result in
significant disk space, FD, and page cache footprint savings for
large instances with many inboxes and many cross-posted
messages.
|
|
There's no need to export it, as shown by the change to
SearchView. This should pave the way to making search
more flexible and allow per-Inbox search to reuse ->ALL.
|
|
Every callback uses `$self', and creating short-lived
array references is not necessary when it's just as
easy to copy the array in Perl (unlike C).
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
We can simplify callers by using $self->{xpfx} instead of
passing another arg on the stack.
|
|
This will provide a similar API to PublicInbox::Inbox for
read-only WWW, -imapd, and -nntpd interfaces.
|
|
We'll be using this in detached (ext) Xapian indexes
in cross inbox search.
|
|
This switch is still undocumented, but we can reduce the scope
of our Xapian docdata dependency by moving its only caller to
SearchIdx. This reduces the amount of code loaded by read-only
code paths.
|
|
We'll also fix the read-only code to ensure we notice missing
Xapian shards, since gaps would throw off our expectation that
Xapian document IDs and NNTP article numbers are interchangeable.
|
|
Only inbox accesses the read-only {over}, now, instead of going
through ->search. This simplifies our object graph and avoids
potentially redundant FDs and DB handles pointing to the same
over.sqlite3 file.
|
|
Nearly all of the search uses in the production code rely on
a Xapian mset iterator being returned (instead of an array
of $smsg objects). So default to returning the mset and move
the burden of smsg array conversion into the test cases.
|
|
The special case (if any) belongs at a higher-level,
and this is another step towards removing {over_ro}-dependence
in our Search object.
|