Date | Commit message (Collapse) |
|
CentOS 7.x ships with git 1.8.5, so unless a CentOS 7.x user
enables 3rd-party repos[1], they'll be stuck with a version
of git without `--stable' (though I'm becoming skeptical of
indexing patchids at all).
[1] https://public-inbox.org/meta/20210421151308.yz5hzkgm75klunpe@nitro.local/
|
|
Xapian has always sorted termlist iterators, so we now:
1) break out of the iterator loop early on non-matches
2) avoid doing sorting ourselves
As a result, we'll also favor the wantarray forms of xap_terms
and all_terms to preserve sort order in most cases.
Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com>
Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
|
|
Oddly, Perl did not warn about this. Spotted while confirming
abbreviated OIDs are also indexed when unabbreviated OIDs
appear.
|
|
When setting up stdin for commands, the write_file API is
convenient enough nowadays to not be worth having special
support with process spawning.
When reading stdout of commands, we should probably be using
utf8_maybe everywhere since there'll always be legacy encodings
in git repos.
Reading regular files with :utf8 also results in worse memory
management since the file size cannot be used as a hint.
|
|
This is a major step in solving the problem of having to
manually associate hundreds/thousands of coderepos with
hundreds/thousands of public-inboxes to power solver
(and more).
|
|
While MTAs seem to stop '\0' from appearing in headers, users
fetching archives via git remain susceptible to having '\0' land
in archives. So we'll filter them out of Xapian and SQLite DBs
to avoid interopability problems with CLI tools since there's no
known messages in lore or any of my archives which feature them.
Avoiding '\0' will ensure all indexed Message-IDs and List-Ids
can be specified from the command-line (although some characters
will still require $(printf) contortions).
As with Message-ID, List-Id fields with /\n\t\r/ characters will
also be stripped for indexing. I will assume whatever went wrong
with the References: header in
<https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw>
could also happen to the List-Id header.
This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4
(Import: Don't copy nulls from emails into git, 2018-07-07)
|
|
Informal benchmarks show a rough 5% indexing improvement on an
SMP system when there are idle cores due to Xapian shards being
I/O bound (since `git patch-id' is mainly CPU bound).
This is only parallelized on a per-patch basis. Further
increasing parallelism would increase complexity and probably
not be worth it since `git patch-id' is reasonably fast while
our text indexing tends to be slow.
|
|
We have to deal with UTF-8 data for generating patches, so make
it easier to pass Perl utf8 data to git, diff, sdiff, etc. to
avoid "Wide character" warnings.
|
|
This will open the door for us to drop `tie' usage from
ProcessIO completely in favor of OO method dispatch. While
OO method dispatches (e.g. `$fh->close') are slower than normal
subroutine calls, it hardly matters in this case since process
teardown is a fairly rare operation and we continue to use
`close($fh)' for Maildir writes.
|
|
This is similar to `backtick` but supports all our existing spawn
functionality (chdir, env, rlimit, redirects, etc.). It also
supports SCALAR ref redirects like run_script in our test suite
for std{in,out,err}.
We can probably use :utf8 by default for these redirects, even.
|
|
Fortunately, I've never actually seen that message...
|
|
It's basically the `system' perlop with support for env overrides,
redirects, chdir, rlimits, and setpgid support.
|
|
This simplifies Git->cat_async_step and fixes Git->async_abort,
the latter of which was passing arguments improperly for the
--batch-check (or `info') case at the cost of making the few
check_async callers handle an extra argument.
The extra (PublicInbox::Git) $self argument for check_async
callbacks is now gone, as avoiding the temporary cyclic
reference doesn't seem worthwhile since the temporary cyclic
reference appears in the ->cat_async code paths, too.
|
|
We can avoid needless refcount traffic in some cases.
|
|
Retrying requests on alternates changing was causing inflight
requests to get lost due to {inflight} getting clobbered by
batch_prepare.
Unfortunately, reproducing this is difficult without mocking
->alternates_changed.
SearchIdx now avoids calling ->batch_prepare directly and
relies on more common API functions.
Fixes: 65db62eb006f ("git: use --batch-command in git 2.36+ to save processes")
|
|
Perl has native `vstring' encoding for vector (or version)
strings, make use of it instead of relying on difficult-to-read
hex versions and integer shifts.
|
|
This lets us get rid of some awkwardness around the old API
and single-use subroutines while saving us some LoC.
|
|
Since CodeSearchIdx doesn't deal with inboxes, it makes sense
to split it out from inbox-specific code and start moving
towards using OnDestroy to restore the umask at the end of
scope and reducing extra functions.
|
|
This matches the behavior of mail indexers and ensures `medium'
indices don't grow unexpectedly to be come `full' indices.
|
|
I favor leaving the publicinbox.<name>.indexlevel parameter
out of config files to make it easier to alter and reduce
sources of truth. It worked well in most cases, but
public-inbox-watch also needs to detect the indexlevel.
Moving the sub to InboxWritable (from Admin) probably makes
sense since it's a per-inbox attribute and allows -watch
to reuse it.
|
|
We need to ensure we don't block indexing for too long while
pruning, since pruning coderepos seems more frequent and
necessary than inbox repos due to the prevalence of force
pushes with branches like `seen' (formerly `pu') in git.git.
Implement this via ->event_step and requeue mechanisms of DS so
we periodically flush our work and let indexing resume.
I originally wanted to implement this as a dedicated group
of workers, but the XS Search::Xapian bug[1] workaround
to handle uncaught C++ exceptions was expensive and complex
compared to the evented mechanism.
[1] https://lists.xapian.org/pipermail/xapian-discuss/2023-March/009967.html
<20230327114604.M803690@dcvr>
|
|
It seems relying on root commits is a reasonable way to
deduplicate and handle repositories with common history.
I initially wanted to shoehorn this into extindex, but decided a
separate Xapian index layout capable of being EITHER external to
handle many forks or internal (in $GIT_DIR/public-inbox-cindex)
for small projects is the right way to go.
Unlike most existing parts of public-inbox, this relies on
absolute paths of $GIT_DIR stored in the Xapian DB and does not
rely on the config file. We'll be relying on the config file to
map absolute paths to public URL paths for WWW.
|
|
Base-85 binary patches were a source of false-positives in results
and we've filtered out in non-quoted text since July 2022.
Unfortunately, people were quoting binary patch contents
in replies (*sigh*) and triggering false positives in search
results. So we must filter out base-85-looking contents from
quoted text, too.
Followup-to: 8fda04081acde705 (search: do not index base-85 binary patches, 2022-06-20)
Followup-to: 840785917bc74c8e (searchidx: skip "delta $N" sections for base-85, 2022-07-19)
|
|
|
|
I don't deal with binary patches ever, so I failed to notice
binary deltas are supported in addition to the more common
literals.
A quick check of apply.c in git.git confirms "delta" and
"literal" are the only binary patch classes we can expect.
|
|
Base-85 binary patches generated by git lead to many false
positives, so skip over gibberish words which may occur in them.
To avoid regressions in search results, continue to allow
searching for exact size matches (via "literal $SIZE") and the
phrase "GIT binary patch" for the mere presence of a binary
patch.
|
|
This allows easy searching via patch-id from a git commit.
Currently, abbreviations are not supported, and it seems
needless to support them since AFAIK (git) doesn't generate
nor resolve abbreviated patch-ids anywhere.
|
|
Current implementations of Perl5 don't have optimizations for
single-character field separators (unlike another non-Perl5 VM
I'm familiar with).
|
|
This enables Xapian::DB_DANGEROUS to support in-place updates.
This can speed up the initial index and reduce I/O at the cost
of preventing concurrent readers and being unsafe in the face of
any abnormal terminations. This is more dangerous than
--no-fsync. --no-fsync is only unsafe in the event of a power
loss or kernel crash; --dangerous is unsafe even on SIGKILL.
|
|
btrfs is Linux-only at the moment (and likely to remain that way
for practical purposes). So rely on Linux ABI stability and use
the `syscall' and `ioctl' perlops rather than relying on Inline::C.
Inline::C (and gcc||clang) are monstrous dependencies which we
can't expect users to have.
This makes supporting new architectures more difficult, but new
architectures come along rarely and this reduces the burden for
the majority of Linux users on popular architectures (while
still avoiding the distribution of pre-built binaries).
Link: https://public-inbox.org/meta/YbCPWGaJEkV6eWfo@codewreck.org/
|
|
This fixes the "Modification of a read-only value attempted at ..."
error in an initial run of t/reindex-time-range.t. It was
reproducible by running `rm -rf t/data-gen/reindex-time-range.v*'
before `make && prove -bvw t/reindex-time-range.t'. Thanks to
Jörg Rödel for providing the backtrace which helped find this.
Debugged-by: Jörg Rödel <joro@8bytes.org>
Link: https://public-inbox.org/meta/YZuZEY+WSnm4wlrS@8bytes.org/
|
|
While we do detailed indexing of git diffs, the header itself
was failing and queries like 'nq:diff' would not work.
Noticed-by: Rob Herring <robh@kernel.org>
|
|
Indexing any inboxes requires SQLite and msgmap, so don't hide
exceptions if it fails.
|
|
When importing several sources in parallel via http(s) mboxrd,
we need to be able to get keywords of uncommitted documents
directly from shard workers. Otherwise, Xapian DocNotFound
errors happen because the read-only LeiSearch won't see
documents from uncomitted transactions. Keep in mind that it's
possible the keywords can be changed on-the-fly even for
uncommitted documents because of inotify watches from LeiNoteEvent.
|
|
This covers v1 inboxes, as well. We also guard the execution
since "PRAGMA optimize" was only introduced in SQLite 3.18.0
(2017-03-30)
|
|
The original Msgmap->new API was v1-specific and not necessary.
The ->new_file API now supports an $ibx object being passed to
it, simplify -no_fsync use. It will also make an upcoming
change easier...
|
|
Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10. We'll also add some checkpoints around over and
xref3 cleanups.
|
|
We need to ensure -extindex --gc runs don't prevent other
work from happening in the meantime. I actually caused
my -extindex to OOM due to the lack of checkpoints :x
We'll also hoist out the shard scanning into its own sub
in preparation for lei/store usage.
|
|
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
|
|
Xapian bindings may not be installed or be out-of-date w.r.t. the
Perl version, improve the visibility of errors in those cases.
Cleanup and drop some redundant checks while we're at it.
Cc: "Toke Høiland-Jørgensen" <toke@toke.dk>
Link: https://public-inbox.org/meta/87k0ky5mbd.fsf@toke.dk/
|
|
This default seems closer to reasonable on 64-bit systems which
are the norm these days. 32-bit systems gain 48K so it's an
even 1 MB, but we need to keep 32-bit systems from using too
much since there's still some ancient systems out there with
small inboxes.
|
|
This allows us to simplify callers throughout, and exceptions are
can no longer be silently hidden. MiscSearch now uses xap_terms
for looking up eidx_key terms for a code reduction.
We also simplify LeiStore->_msg_kw for runtime use by moving the
MsetIterator handling into t/lei_store.t test case.
|
|
I'm not sure how this happened (only once for me in March), but
it should not happen... In any case, we'll operate on the
lowest numbered docid and cull redundant index entries when
lei/store is open for read-write.
This also fixes the normal lei/store removal path to clean up
the xref3 table (since it's not done automatically for
public-facing -eidx due to the multi-list nature of it).
|
|
This saves some work and makes it easier to set volatile
metadata on a message at import time.
|
|
"lei q" now displays labels in JSON output, "lei mark"
can add or remove labels for any messages.
"lei ls-label" is supported, too.
Unfortunately, "lei q" won't hande "kw:" or "L:" for
external messages, they must be imported, first.
|
|
Only tested for keywords and labels with file inputs, so far;
but it seems to do what it needs to do. There's a bit more
redundant code than I'd like, and more opportunities for code
sharing in the future
"lei import" will be expanded to support +kw:$KEYWORD and
+L:$LABEL in the future.
|
|
Keyword storage for external-only messages was preventing
messages from being explicitly imported. Teach lei_store
to vivify keyword-only entries into fully-indexed messages
on import.
|
|
"lei q" now preserves changes per-message keywords across
invocations when it's --output (Maildir or mbox) is reused
(with or without --augment).
In the future, these changes will be monitored via inotify,
EVFILT_VNODE or IMAP IDLE, too.
Unfortunately, this currently prevents "lei import" from ever
importing a message that's in an external. That will be fixed
in a future change.
|
|
Since keywords and mailboxes (AKA labels) are separate things in
JMAP; and only keywords can map reliably to Maildir and mbox;
we'll keep them separate in our internal data representations,
too.
I initially wanted to call this just "meta" for "metadata", but
that might be confused with our mailing list name. "metadata"
is already used in Xapian's own API, to add another layer of
confusion.
"tags" was also considered, but probably confusing to notmuch
users since our "labels" are analogous to "tags" in notmuch,
and notmuch doesn't seem to cover "keywords" separately...
So "vmd" it is, since we haven't used this particular
three-letter-abbreviation anywhere before; and "volatile" seems
like a good description of this metadata since everything else
up to this point has been mostly WORM (write-once, read-many).
|
|
This fixes "m:", "l:", "f:", "t:", "c:", "dfn:", and "n:" search
prefixes under indexlevel=medium when mixed with indexlevel=full
inboxish. We need positional data for Message-IDs, List-Id,
email addresses and filenames for exact matches, though we still
want to support wildcards.
Fortunately the storage cost is still small as these prefixes
tend to be small compared to message bodies. These are NOT
boolean terms since wildcard support and partial matching is
desired.
|