about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2023-12-16searchidx: quiet down old git patchid
CentOS 7.x ships with git 1.8.5, so unless a CentOS 7.x user enables 3rd-party repos[1], they'll be stuck with a version of git without `--stable' (though I'm becoming skeptical of indexing patchids at all). [1] https://public-inbox.org/meta/20210421151308.yz5hzkgm75klunpe@nitro.local/
2023-12-09*search: simplify handling of Xapian term iterators
Xapian has always sorted termlist iterators, so we now: 1) break out of the iterator loop early on non-matches 2) avoid doing sorting ourselves As a result, we'll also favor the wantarray forms of xap_terms and all_terms to preserve sort order in most cases. Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com> Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
2023-12-05searchidx: drop redundant decl in index_git_blob_id
Oddly, Perl did not warn about this. Spotted while confirming abbreviated OIDs are also indexed when unabbreviated OIDs appear.
2023-11-30spawn: drop IO layer support from redirects
When setting up stdin for commands, the write_file API is convenient enough nowadays to not be worth having special support with process spawning. When reading stdout of commands, we should probably be using utf8_maybe everywhere since there'll always be legacy encodings in git repos. Reading regular files with :utf8 also results in worse memory management since the file size cannot be used as a hint.
2023-11-29www: load and use cindex join data
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more).
2023-11-28disallow NUL characters in Message-ID and List-Id
While MTAs seem to stop '\0' from appearing in headers, users fetching archives via git remain susceptible to having '\0' land in archives. So we'll filter them out of Xapian and SQLite DBs to avoid interopability problems with CLI tools since there's no known messages in lore or any of my archives which feature them. Avoiding '\0' will ensure all indexed Message-IDs and List-Ids can be specified from the command-line (although some characters will still require $(printf) contortions). As with Message-ID, List-Id fields with /\n\t\r/ characters will also be stripped for indexing. I will assume whatever went wrong with the References: header in <https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw> could also happen to the List-Id header. This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4 (Import: Don't copy nulls from emails into git, 2018-07-07)
2023-11-21searchidx: run `git patch-id' in parallel
Informal benchmarks show a rough 5% indexing improvement on an SMP system when there are idle cores due to Xapian shards being I/O bound (since `git patch-id' is mainly CPU bound). This is only parallelized on a per-patch basis. Further increasing parallelism would increase complexity and probably not be worth it since `git patch-id' is reasonably fast while our text indexing tends to be slow.
2023-11-03spawn: support PerlIO layer in scalar redirects
We have to deal with UTF-8 data for generating patches, so make it easier to pass Perl utf8 data to git, diff, sdiff, etc. to avoid "Wide character" warnings.
2023-11-03treewide: use ->close to call ProcessIO->CLOSE
This will open the door for us to drop `tie' usage from ProcessIO completely in favor of OO method dispatch. While OO method dispatches (e.g. `$fh->close') are slower than normal subroutine calls, it hardly matters in this case since process teardown is a fairly rare operation and we continue to use `close($fh)' for Maildir writes.
2023-10-25spawn: support synchronous run_qx
This is similar to `backtick` but supports all our existing spawn functionality (chdir, env, rlimit, redirects, etc.). It also supports SCALAR ref redirects like run_script in our test suite for std{in,out,err}. We can probably use :utf8 by default for these redirects, even.
2023-10-04searchidx: fix redundant `in' in warning message
Fortunately, I've never actually seen that message...
2023-09-26spawn: add run_wait to simplify spawn+waitpid use
It's basically the `system' perlop with support for env overrides, redirects, chdir, rlimits, and setpgid support.
2023-04-29git: make check_async callbacks identical to cat_async
This simplifies Git->cat_async_step and fixes Git->async_abort, the latter of which was passing arguments improperly for the --batch-check (or `info') case at the cost of making the few check_async callers handle an extra argument. The extra (PublicInbox::Git) $self argument for check_async callbacks is now gone, as avoiding the temporary cyclic reference doesn't seem worthwhile since the temporary cyclic reference appears in the ->cat_async code paths, too.
2023-04-25searchidx: reduce short-lived variables for TermGenerator
We can avoid needless refcount traffic in some cases.
2023-04-11git: fix cat_async_retry
Retrying requests on alternates changing was causing inflight requests to get lost due to {inflight} getting clobbered by batch_prepare. Unfortunately, reproducing this is difficult without mocking ->alternates_changed. SearchIdx now avoids calling ->batch_prepare directly and relies on more common API functions. Fixes: 65db62eb006f ("git: use --batch-command in git 2.36+ to save processes")
2023-04-07searchidx: use vstring to improve readability
Perl has native `vstring' encoding for vector (or version) strings, make use of it instead of relying on difficult-to-read hex versions and integer shifts.
2023-04-07umask: rely on the OnDestroy-based call where applicable
This lets us get rid of some awkwardness around the old API and single-use subroutines while saving us some LoC.
2023-04-07umask: hoist out of InboxWritable
Since CodeSearchIdx doesn't deal with inboxes, it makes sense to split it out from inbox-specific code and start moving towards using OnDestroy to restore the umask at the end of scope and reducing extra functions.
2023-04-07cindex: preserve indexlevel across invocations
This matches the behavior of mail indexers and ensures `medium' indices don't grow unexpectedly to be come `full' indices.
2023-04-06watch: use detect_indexlevel for unconfigured inboxes
I favor leaving the publicinbox.<name>.indexlevel parameter out of config files to make it easier to alter and reduce sources of truth. It worked well in most cases, but public-inbox-watch also needs to detect the indexlevel. Moving the sub to InboxWritable (from Admin) probably makes sense since it's a per-inbox attribute and allows -watch to reuse it.
2023-03-29cindex: interleave prune with indexing
We need to ensure we don't block indexing for too long while pruning, since pruning coderepos seems more frequent and necessary than inbox repos due to the prevalence of force pushes with branches like `seen' (formerly `pu') in git.git. Implement this via ->event_step and requeue mechanisms of DS so we periodically flush our work and let indexing resume. I originally wanted to implement this as a dedicated group of workers, but the XS Search::Xapian bug[1] workaround to handle uncaught C++ exceptions was expensive and complex compared to the evented mechanism. [1] https://lists.xapian.org/pipermail/xapian-discuss/2023-March/009967.html <20230327114604.M803690@dcvr>
2023-03-25codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to deduplicate and handle repositories with common history. I initially wanted to shoehorn this into extindex, but decided a separate Xapian index layout capable of being EITHER external to handle many forks or internal (in $GIT_DIR/public-inbox-cindex) for small projects is the right way to go. Unlike most existing parts of public-inbox, this relies on absolute paths of $GIT_DIR stored in the Xapian DB and does not rely on the config file. We'll be relying on the config file to map absolute paths to public URL paths for WWW.
2023-02-20searchidx: do not index quoted Base-85 patches
Base-85 binary patches were a source of false-positives in results and we've filtered out in non-quoted text since July 2022. Unfortunately, people were quoting binary patch contents in replies (*sigh*) and triggering false positives in search results. So we must filter out base-85-looking contents from quoted text, too. Followup-to: 8fda04081acde705 (search: do not index base-85 binary patches, 2022-06-20) Followup-to: 840785917bc74c8e (searchidx: skip "delta $N" sections for base-85, 2022-07-19)
2022-08-18searchidx: fix spelling error in comment
2022-07-19searchidx: skip "delta $N" sections for base-85
I don't deal with binary patches ever, so I failed to notice binary deltas are supported in addition to the more common literals. A quick check of apply.c in git.git confirms "delta" and "literal" are the only binary patch classes we can expect.
2022-06-21search: do not index base-85 binary patches
Base-85 binary patches generated by git lead to many false positives, so skip over gibberish words which may occur in them. To avoid regressions in search results, continue to allow searching for exact size matches (via "literal $SIZE") and the phrase "GIT binary patch" for the mere presence of a binary patch.
2022-06-21search: support "patchid:" prefix (git patch-id --stable)
This allows easy searching via patch-id from a git commit. Currently, abbreviations are not supported, and it seems needless to support them since AFAIK (git) doesn't generate nor resolve abbreviated patch-ids anywhere.
2022-06-21searchidx: use regexp as first arg for `split' op
Current implementations of Perl5 don't have optimizations for single-character field separators (unlike another non-Perl5 VM I'm familiar with).
2022-03-08index|extindex: support --dangerous flag
This enables Xapian::DB_DANGEROUS to support in-place updates. This can speed up the initial index and reduce I/O at the cost of preventing concurrent readers and being unsafe in the face of any abnormal terminations. This is more dangerous than --no-fsync. --no-fsync is only unsafe in the event of a power loss or kernel crash; --dangerous is unsafe even on SIGKILL.
2022-01-31rewrite Linux nodatacow use in pure Perl w/o system
btrfs is Linux-only at the moment (and likely to remain that way for practical purposes). So rely on Linux ABI stability and use the `syscall' and `ioctl' perlops rather than relying on Inline::C. Inline::C (and gcc||clang) are monstrous dependencies which we can't expect users to have. This makes supporting new architectures more difficult, but new architectures come along rarely and this reduces the burden for the majority of Linux users on popular architectures (while still avoiding the distribution of pre-built binaries). Link: https://public-inbox.org/meta/YbCPWGaJEkV6eWfo@codewreck.org/
2021-11-22searchidx: avoid modification of read-only `$_'
This fixes the "Modification of a read-only value attempted at ..." error in an initial run of t/reindex-time-range.t. It was reproducible by running `rm -rf t/data-gen/reindex-time-range.v*' before `make && prove -bvw t/reindex-time-range.t'. Thanks to Jörg Rödel for providing the backtrace which helped find this. Debugged-by: Jörg Rödel <joro@8bytes.org> Link: https://public-inbox.org/meta/YZuZEY+WSnm4wlrS@8bytes.org/
2021-11-09searchidx: index "diff --git a/... b/..." headers
While we do detailed indexing of git diffs, the header itself was failing and queries like 'nq:diff' would not work. Noticed-by: Rob Herring <robh@kernel.org>
2021-10-23searchidx: v1: raise on msgmap init failure
Indexing any inboxes requires SQLite and msgmap, so don't hide exceptions if it fails.
2021-10-15lei q: avoid kw lookup failure on remote mboxrd
When importing several sources in parallel via http(s) mboxrd, we need to be able to get keywords of uncommitted documents directly from shard workers. Otherwise, Xapian DocNotFound errors happen because the read-only LeiSearch won't see documents from uncomitted transactions. Keep in mind that it's possible the keywords can be changed on-the-fly even for uncommitted documents because of inotify watches from LeiNoteEvent.
2021-10-13index: optimize after all SQLite DB commits
This covers v1 inboxes, as well. We also guard the execution since "PRAGMA optimize" was only introduced in SQLite 3.18.0 (2017-03-30)
2021-10-12msgmap: ->new_file to supports $ibx arg, drop ->new
The original Msgmap->new API was v1-specific and not necessary. The ->new_file API now supports an $ibx object being passed to it, simplify -no_fsync use. It will also make an upcoming change easier...
2021-10-10extindex: speed up Xapian cleanup in --gc
Avoiding repeated SQL statements brings --gc down to 2-3 minutes from around 10. We'll also add some checkpoints around over and xref3 cleanups.
2021-10-06extindex: --gc checkpoints
We need to ensure -extindex --gc runs don't prevent other work from happening in the meantime. I actually caused my -extindex to OOM due to the lack of checkpoints :x We'll also hoist out the shard scanning into its own sub in preparation for lei/store usage.
2021-10-05index: --reindex w/ --{since,until,before,after}
This lets administrators reindex specific time ranges according to git "approxidate" formats. These arguments are passed directly to underlying git-log(1) invocations and may still reach into old epochs. Since these options rely on git committer dates (which we infer from the most recent Received: header), they are not guaranteed to be strictly tied to git history and it's possible to over/under-reindex some messages. It's probably not a major problem in practice, though; reindexing a few extra messages is generally harmless aside from some extra device wear. Since this currently relies on git-log, these options do not affect -extindex, yet.
2021-08-08searchidx: die on Xapian load errors
Xapian bindings may not be installed or be out-of-date w.r.t. the Perl version, improve the visibility of errors in those cases. Cleanup and drop some redundant checks while we're at it. Cc: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://public-inbox.org/meta/87k0ky5mbd.fsf@toke.dk/
2021-06-30searchidx: default BATCH_BYTES to 8MB on 64-bit systems
This default seems closer to reasonable on 64-bit systems which are the norm these days. 32-bit systems gain 48K so it's an even 1 MB, but we need to keep 32-bit systems from using too much since there's still some ancient systems out there with small inboxes.
2021-06-23search: make xap_terms easier-to-use and use it more
This allows us to simplify callers throughout, and exceptions are can no longer be silently hidden. MiscSearch now uses xap_terms for looking up eidx_key terms for a code reduction. We also simplify LeiStore->_msg_kw for runtime use by moving the MsetIterator handling into t/lei_store.t test case.
2021-06-17lei/store: cull redundant docids based on blob OID
I'm not sure how this happened (only once for me in March), but it should not happen... In any case, we'll operate on the lowest numbered docid and cull redundant index entries when lei/store is open for read-write. This also fixes the normal lei/store removal path to clean up the xref3 table (since it's not done automatically for public-facing -eidx due to the multi-list nature of it).
2021-04-23lei import: support adding keywords and labels on import
This saves some work and makes it easier to set volatile metadata on a message at import time.
2021-03-26lei: add some labels support
"lei q" now displays labels in JSON output, "lei mark" can add or remove labels for any messages. "lei ls-label" is supported, too. Unfortunately, "lei q" won't hande "kw:" or "L:" for external messages, they must be imported, first.
2021-03-24lei mark: command for (un)setting keywords and labels
Only tested for keywords and labels with file inputs, so far; but it seems to do what it needs to do. There's a bit more redundant code than I'd like, and more opportunities for code sharing in the future "lei import" will be expanded to support +kw:$KEYWORD and +L:$LABEL in the future.
2021-03-21lei import: vivify external-only messages
Keyword storage for external-only messages was preventing messages from being explicitly imported. Teach lei_store to vivify keyword-only entries into fully-indexed messages on import.
2021-03-21lei q: support vmd for external-only messages
"lei q" now preserves changes per-message keywords across invocations when it's --output (Maildir or mbox) is reused (with or without --augment). In the future, these changes will be monitored via inotify, EVFILT_VNODE or IMAP IDLE, too. Unfortunately, this currently prevents "lei import" from ever importing a message that's in an external. That will be fixed in a future change.
2021-03-17lei_store: keywords => vmd (volatile metadata), prepare for labels
Since keywords and mailboxes (AKA labels) are separate things in JMAP; and only keywords can map reliably to Maildir and mbox; we'll keep them separate in our internal data representations, too. I initially wanted to call this just "meta" for "metadata", but that might be confused with our mailing list name. "metadata" is already used in Xapian's own API, to add another layer of confusion. "tags" was also considered, but probably confusing to notmuch users since our "labels" are analogous to "tags" in notmuch, and notmuch doesn't seem to cover "keywords" separately... So "vmd" it is, since we haven't used this particular three-letter-abbreviation anywhere before; and "volatile" seems like a good description of this metadata since everything else up to this point has been mostly WORM (write-once, read-many).
2021-03-13searchidx: fix -Lmedium for IDs and filenames
This fixes "m:", "l:", "f:", "t:", "c:", "dfn:", and "n:" search prefixes under indexlevel=medium when mixed with indexlevel=full inboxish. We need positional data for Message-IDs, List-Id, email addresses and filenames for exact matches, though we still want to support wildcards. Fortunately the storage cost is still small as these prefixes tend to be small compared to message bodies. These are NOT boolean terms since wildcard support and partial matching is desired.