Date | Commit message (Collapse) |
|
External Xapian helper processes need to support non-standard
QueryParser prefixes. The only way to do this is to specify
these prefixes in every `mset' request since we have no idea
if the XH worker servicing the request has initialized the
extra prefixes, yet.
|
|
Only public-facing daemons use it, currently, and all
public-facing daemons will pre-spawn it as early as feasible.
lei will need it eventually to handle queries requiring C++,
but I'm not certain what path to take with lei, yet...
|
|
Xapian helper processes are disabled by default once again.
However, they can be enabled via the new `-X INTEGER' parameter.
One big positive is the Xapian helpers being spawned by the
top-level daemon means they can be shared freely across all
workers for improved load balancing and memory reduction.
|
|
We need to be able to handle resource limitation errors in
public-facing daemons.
|
|
The C++ version of xap_helper will allow more complex and
expensive queries. Both the Perl and C++-only version will
allow offloading search into a separate process which can be
killed via ITIMER_REAL or RLIMIT_CPU in the face of overload.
The xap_helper `mset' command wrapper is simplified to
unconditionally return rank, percentage, and estimated matches
information. This may slightly penalize mbox retrievals and
lei users, but perhaps that can be a different command entirely.
|
|
Retrieving Xapian document terms, data (and possibly values) and
transferring to the Perl side would be an increase in complexity
and I/O both the Perl and C++ sides. It would require more I/O
in C++ and transient memory use on the Perl side where slow mset
iteration gives an opportunity to dictate memory release rate.
So lets ignore the document-related stuff here for now for
ease-of-development. We can reconsider this change if dropping
Xapian Perl bindings entirely and relying on JAOT C++ ever
becomes a possibility.
|
|
Xapian has always sorted termlist iterators, so we now:
1) break out of the iterator loop early on non-matches
2) avoid doing sorting ourselves
As a result, we'll also favor the wantarray forms of xap_terms
and all_terms to preserve sort order in most cases.
Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com>
Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
|
|
This is a major step in solving the problem of having to
manually associate hundreds/thousands of coderepos with
hundreds/thousands of public-inboxes to power solver
(and more).
|
|
The C++ version will allow us to take full advantage of Xapian's
APIs for better queries, and the Perl bindings version can still
be advantageous in the future since we'll be able to support
timeouts effectively.
|
|
Fixes: 738c4a65719e ("www: various help text updates")
|
|
The Xapian SWIG bindings are favored by Xapian upstream for
ease-of-maintenance compared to the XS version. While Debian
lags on this front, the SWIG bindings are widely available
on all *BSDs.
|
|
This allows us to perform the expensive "dump_ibx" operations in
native C++ code using the Xapian C++ library. This provides the
majority of the speedup with the -cindex --associate switch.
Eventually this may be expanded to cover all uses of Xapian
within the project to ensure we have access to Xapian APIs which
aren't available in XS|SWIG bindings; and also for
ease-of-installation on systems which don't provide
pre-packaged Perl Xapian bindings (e.g. OpenBSD 7.3) but
do provide Xapian development libraries.
Most of the C++ code is still C, as I'm not remotely familiar
with C++ compared to C. I suspect many users and potential
hackers being from git, Linux kernel, and glibc world are in the
same boat.
|
|
This will eventually allow associating coderepos with inboxes
and vice-versa; avoiding the need for manual configuration via
tedious publicinbox.*.coderepo directives.
I'm not sure how this should be stored for WWW, yet, but it's
required since it takes about 8 hours to do this fully across
lore and git.kernel.org.
|
|
This will be useful for internal tooling and APIs.
|
|
The ->allterms_{begin,end} methods of Xapian::Database already
filter match on prefix natively. Thus there's no need to do
filtering ourselves (unlike per-document ->termlist_{begin/end})
|
|
Reusing this bit seems to make sense as mail and code search
are similar enough w.r.t. setting up sort options. This
deduplication will become more useful as -cindex will
likely combine code and mail search to generate associations
between inboxes and code repos.
|
|
Add some comments about various usages of xdb_shards_flat and
mset since the addition of CodeSearch (and other search things)
subclassing it may become confusing.
Since we're in the area, we can also avoid an extra hash
lookups/initializations and reduce Perl ops in various places.
|
|
This allows filtering the contents of any existing thread using
a search query. It uses the existing THREADID column in Xapian
so we can internally add a Xapian OP_FILTER to the results.
This new functionality is orthogonal to the existing `t=1'
parameter which gives mairix-style thread expansion. It doesn't
make sense to use `t=1' with this functionality, but it's not
disallowed, either.
The indentation change in Over->next_by_mid is to ensure
DBI->prepare_cached can share across both ->next_by_mid
and ->mid2tid.
I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was
allowing extra characters. With an added \z, it's now as strict
was originally intended and AFAIK nothing was generating invalid
URLs for it
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
|
|
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
|
|
This will be used for code_search, too.
|
|
dt: is higher resolution and the YYYYMMDD column will be dropped
if there's ever another SCHEMA_VERSION update. While the
upcoming code repo index is independent of the mail schemas,
it'll use similar query prefixes and likely use d:/dt: for
Author Date of git commits.
|
|
The Xapian query transformation and Enquire object setup aren't
subject to MVCC and retries, so move it outside the retry loop
to save some cycles in case we need to retry on a busy DB.
|
|
While having Xapian collapse threads is an easy way to reduce
the amount of deduplication work we need to do when writing
out threads; we can't rely on it when using `lei q -tt` since
that needs to flag all hits.
Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Link: https://public-inbox.org/git/Y+pgBmj0jxR+cVkD@mail.gmail.com/
|
|
This will be used to speed up NNTP group listings and IMAP startup
with thousands of inboxes.
|
|
Noticed-by: Kyle Meyer <kyle@kyleam.com>
|
|
This allows easy searching via patch-id from a git commit.
Currently, abbreviations are not supported, and it seems
needless to support them since AFAIK (git) doesn't generate
nor resolve abbreviated patch-ids anywhere.
|
|
The new Documentation/common.perl file will be used for
all manpages in the future.
|
|
Some yak-shaving while I try to track down other bugs...
|
|
`dt:' documentation is redundant with `d:' approxidate support;
so drop `dt:' since mairix uses `d:'. We'll also document
`rt:' since there are legit messages from senders with broken
clocks.
Reduce indentation level of help texts to be in 2-space
increments to using too much horizontal space.
We'll always place IMAP ahead of NNTP since it's alphabetical
and there's likely more IMAP clients out there.
Add "--ng NEWSGROUP" to -init instructions if configured.
There's also some minor wording changes throughout.
|
|
Inbox->xdb does not exist, but this code path was apparently
never tested :x I noticed this on basic v2 inbox, but it could
happen with any v1/v2 inbox. Move ->num2docid into Search
so it's less awkward to use.
|
|
The cost of opening a Xapian DB (even with shards) isn't high,
so save some FDs and just close it. We hit Xapian far less than
over.sqlite3 and we discard the MSet ASAP even when streaming
large responses.
This simplifies our code a bit and hopefully helps reduce
fragmentation by increasing mortality of late allocations.
|
|
Xapian::QueryParser is attached to the Xapian::Database,
so holding onto the QueryParser was preventing us from
releasing DB handles if a query was performed.
|
|
`undef' entries still take up a slot in the hash table, and
cause the `exists' check to false-positive in ->cleanup_shards.
This should fully fix the (innocuous) messages introduced in
commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
|
|
Neither Inboxes nor ExtSearch objects were retrying correctly
when there are live git processes, but the inboxes were getting
rescanned for search or other reasons. Ensure the scan retries
eventually if there's live processes.
We also need to update the cleanup task to detect Xapian shard
count changes, since Xapian ->reopen is enough to detect any
other Xapian changes. Otherwise, we just issue an inexpensive
->reopen call and let Xapian check whether there's anything
worth reopening.
This also lets us eliminate the Devel::Peek dependency.
|
|
It's needless noise in syslogs for daemons and unnecessarily
alarming to users on the command-line.
|
|
While git respects a user's local timezone and returns
seconds-since-the-Epoch, we were unnecessarily and incorrectly
calling gmtime+strftime on its result. So ignore calling
gmtime+strftime when the strftime format is "%s", just feed
the output time from git directly to Xapian.
This is mainly for lei, which will likely run in a variety of
timezones. While we're at it, add a recommendation to use
TZ=UTC in public-inbox-httpd, in case there are (misguided :P)
sysadmins who set a non-UTC TZ.
|
|
Since extindex uses Xapian shards in a similar way to
v2 inboxes, we'll support -xcpdb (reshard+upgrade) and
-compact all the same to give admins tuning+upgrade
options.
|
|
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
|
|
Xapian DBs may be modified by a parallel process while we're
reading it, and Xapian's MVCC model places the burden on readers
to retry operations.
We'll also have retry_reopen croak instead of die on errors,
which ought to help us track down some "Document not found"
errors I've occasionally seen when using "lei <q|up>".
|
|
If a user specifies "d:" with a higher precision than it was
traditionally able to handle, switch transparently to "dt:".
This lowers the learning curve and improves DWIM-ness.
v2: fix "d:YYYYMMDD..$NEEDS_APPROXIDATE" case
|
|
"lei q" now displays labels in JSON output, "lei mark"
can add or remove labels for any messages.
"lei ls-label" is supported, too.
Unfortunately, "lei q" won't hande "kw:" or "L:" for
external messages, they must be imported, first.
|
|
These have been confusing to me in the past, too.
|
|
So far, searching by size has never been publicly documented,
and IMHO, of questionable utility. In any case, "z:" is what
mairix(1) uses, so it may be familiar to existing mairix users
(I've never used this prefix myself).
So far, this prefix is only used internally in tests and in
auto-translated queries from IMAP; thus this incompatible change
is unlikely to affect anyone.
|
|
The cleanup doesn't seem to matter, I initially thought I needed
to handle "" (two double quotes) explicitly because that's what
Xapian does to escape a double quote inside a double-quoted
phrase. It turns out we only need to be able to pass phrases
through to Xapian unmodified, and the existing group of
["\x{201c}\x{201d}] is sufficient for our purposes.
|
|
This is for consistency with --stdin and WWW front ends
which can't distinguish between phrase searches and
prefix ranges used for d:/dt:/rt:.
In any case, I expect users on the lei command-line are more
likely to use `5.days.ago' instead of `"5 days ago"'
|
|
This greatly improves the usability of d:, dt:, and rt: search
prefixes for users already familiar git's "approxidate" feature.
That is, users familiar with the --(since|after|until|before)=
options in git-log(1) and similar commands will be able to use
those dates in the WWW UI.
|
|
This fixes both an old bug in "lei q" argv handling and one
recent regression introduced with the change to use approxidate.
Field prefixes are also handled correctly inside parenthesized
statements when the field follows "(" without a separation
character.
Fixes: fbb7ccabbf54a405 ("lei q: use git approxidate with d:, dt: and rt: ranges")
|
|
Order doesn't matter when users are completely downloading
mboxrds onto the FS and then opening them with an MUA. The
MUA is expected to sort the results in the user's preferred
order.
However, lei can start streaming the results to its destination
Maildir (or eventually IMAP/JMAP mailbox) with an MUA already
open. This will let users see recent results sooner in their
MUA, as those tend to have a higher docid. This matches the
behavior of the HTML results, as well.
As a bonus, this is around ~5% faster in a one-off, informal
test case with 66k results. I expect this to hold true in all
all cases since git has always optimized storage to favor recent
objects.
|
|
This is necessary to avoid slowdowns with pathological cases
with many dates in the query, since each rev-parse invocation
takes ~5ms.
This is immeasurably slower with one open-ended range, but
already faster with any closed range featuring two dates which
require parsing via git.
|
|
Instead of having --(sent|received)-(before|after)=s
command-line switches, we'll just try to make sense of argv so
it's usable within parenthesized statements and such.
Given the negligible performance penalty with Inline::C
process spawning, we'll probably wire this up to the
WWW interface, too.
"d:" is for mairix compatibility. I don't know if "dt:" and
"rt:" will be too useful, but they exist because of IMAP
(and JMAP).
|