about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2023-12-09xap_helper: support term length limit
This will allow us to use p2q-compatible specifications such as "dfpost7" to only capture blob OIDs which are 7 characters in length (the indexer will always index down to 7 characters)
2023-12-09xap_helper_cxx: drop chdir usage in build
While chdir simplifies path manipulation on our end, its use falls over when PERL5LIB/@INC contains relative paths which need to be made absolute. It's fewer lines of code to get eliminate chdir usage than it is to keep using relative paths in most places.
2023-12-09*search: favor wantarray form of xap_terms
Most xap_terms callers do not benefit from the hashref return value, and we can delay hashmap use until List::Util::uniqstr if needed.
2023-12-09*search: simplify handling of Xapian term iterators
Xapian has always sorted termlist iterators, so we now: 1) break out of the iterator loop early on non-matches 2) avoid doing sorting ourselves As a result, we'll also favor the wantarray forms of xap_terms and all_terms to preserve sort order in most cases. Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com> Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
2023-12-08workaround --headers bug with spamc(1)
As of SpamAssassin 4.0.0, spamc(1) corrupts messages with NUL in the body when the `--headers' switch is used. This increases transport costs, but most spamc/spamd setups are via local sockets, so it's unlikely to be significant. Link: https://bugs.debian.org/1057749 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
2023-12-06cindex: avoid recursion on prune
There's no need to recurse and trigger deep recursion warnings when we hit a coderepo with a known hash (SHA-1 vs SHA-256). Noticed while pruning the 1200+ repos on a git.kernel.org mirror.
2023-12-05cindex: index full (40/64 char) hex blob OIDs
This future proofs the index against git auto-abbreviation needing more characters as the repo grows. It'll be useful for joining against inboxes using dfpre. As with emails, we'll continue indexing abbreviated blob OIDs down to 7 hex characters so a SHA-1 git repo will have all abbreviations of the OID from 7-39 hex characters in addition to the 40 character unabbreviated form.
2023-12-05searchidx: drop redundant decl in index_git_blob_id
Oddly, Perl did not warn about this. Spotted while confirming abbreviated OIDs are also indexed when unabbreviated OIDs appear.
2023-12-01xap_helper: enable stderr assignment on DragonFly
It looks like DragonFly inherited this from FreeBSD to allow us to save us some syscalls.
2023-12-01tests: note kevent+tmpfs failures on DragonFly <= 6.4
I forgot to set TMPDIR=/path/to/non-tmpfs again.
2023-12-01xap_helper.h: fix non-assignable stderr case
I mixed up "flush" with "close" :x Fixes: 87b7f633f241 (xap_helper: implement mset endpoint for WWW, IMAP, etc...)
2023-11-30codesearch: use retry_reopen for WWW
As with mail search, a cindex may be updated while WWW is serving requests. Thus we must reopen the Xapian DB when the revision we're using becomes stale.
2023-11-30inbox: shrink data structures for publicinbox.*.hide
We no longer vivify the intermediate $ibx->{-hide} hashref, instead we use $ibx->{-hide_$KEY} directly. This avoids an intermediate hashref and extra hash table lookups.
2023-11-30www_listing: support publicInbox.nameIsUrl
This is a convenient (and slightly memory-saving) alternative to specifying a `publicinbox.*.url' entry for every single inbox when using publicinbox.wwwListing.
2023-11-30git_async_cat: use git from "all" extindex if possible
For inboxes associated with an extindex (currently only the special "all") one, we can share the git process across all those inboxes unambiguously when retrieving full SHA-1 blobs. The comment for my proposed patch is also out-of-date as that git speedup has been a part of git since 2.33.
2023-11-30inbox: expire resources more aggressively
We no longer trigger git cleanups from the Inbox package since `git cat-file' users have their own cleanup to support git coderepos not associated with any inbox. This change means we unconditionally expire SQLite and Xapian FDs and some internal caches regardless of git activity. The old logic was irrelevant to Gcf2 (libgit2) users anyways since we couldn't determine whether or not an inbox was active based on {inflight} git requests, and upcoming changes will make it inaccurate for all extindex/cindex users as well. Opening SQLite and Xapian DBs is fairly cheap; so it's a small price to pay to reduce memory use and fragmentation.
2023-11-30cindex: speed up initial scan setup phase
This brings a no-op -cindex scan of a git.kernel.org mirror down from 70s to 10s with a hot cache on a busy machine. CPU-intensive SHA-256 fingerprinting of the `git show-ref' result can be parallelized on shard workers. Future changes can move more of the initial scan setup phase into shard workers for more parallelism. But most of the performance for skipping unchanged repos is gained from delaying the commit time reading until we've seen the fingerprint is out-of-date, since reading commit times requires a large amount of I/O compared to only reading refs for fingerprints.
2023-11-30spawn: drop IO layer support from redirects
When setting up stdin for commands, the write_file API is convenient enough nowadays to not be worth having special support with process spawning. When reading stdout of commands, we should probably be using utf8_maybe everywhere since there'll always be legacy encodings in git repos. Reading regular files with :utf8 also results in worse memory management since the file size cannot be used as a hint.
2023-11-30cindex: skip getpid guard for most OnDestroy use
We no longer fork after cidx_init, so there's no need to spend CPU cycles on the getpid() syscall, especially since it's no longer cached on glibc while syscalls are also more expensive these days due to CPU vulnerability mitigations.
2023-11-30git: share unlinked pack checking code with gcf2
It saves some code in case we keep libgit2 around.
2023-11-30cindex: store extensions.objectFormat with repo data
This will allow WWW to use a combined LeiALE-like thing to reduce git processes.
2023-11-30cindex: keep batch pipe for pruning SHA-256 repos
This fixes the case where we're running both SHA-256 and SHA-1. There's no tests for SHA-256, yet, but the bug is pretty obvious upon reading the code.
2023-11-30cindex: only create {-cidx_err} field on failures
We only use it as a boolean flag, and there's no need to waste space for common, non-error cases.
2023-11-30config: reject newlines consistently in dir names
Explicitly drop support for "\n" in git coderepo pathnames as we do other stuff. Gcf2 (our libgit2 helper) was always broken with "\n" in pathnames, and I'm not sure if cgit config files work with them, either. Dealing with newline characters requires extra complexity that I'm not willing to deal with when managing alternates files.
2023-11-30codesearch: allow inbox count to exceed matches
It's entirely possible for public inboxes to have zero patches in them, so the amount of match slots may not match match the number of joined ekeys.
2023-11-30cindex: fix store_repo+repo_stored on no-op
It's possible to update the fingerprint for a given repo when we have no commits to index on because they were already done for another repo. Thus we'll always vivify $repo_ctx->{active} before calling store_repo since $active may've been undef.
2023-11-29www: mail_diff: add missing </pre> tag
Found by tidy(1) while dealing with other stuff.
2023-11-29www: mail_diff: add final newline before diffing
This gets rid of the "\ No newline at end of file" since it's distracting noise.
2023-11-29www: mail_diff: fix optional address obfuscation
We need to load the proper package and fully-qualify the sub call since we shouldn't load Hval in lei. Some users use this feature even if its broken, oh well :<
2023-11-29lei q: fix --no-import-before completion + docs
--no-import-before skips importing entire messages, not just keywords, so it can cause permanent data loss if -o is pointed to precious data.
2023-11-29www: load cindex join data for ->ALL, too
This ensures the /all/ extindex can have auto-associations with coderepos just like normal inboxes do.
2023-11-29www: start working on a repo listing
The HTML is still extremely rough, but links seem to be mostly working...
2023-11-29cindex: extra quit checks
We don't want to be accessing uninitialized variables on process teardown since much of our control flow revolves around DESTROY for dependency handling.
2023-11-29admin: resolve_git_dir respects symlinks
Absolute pathnames of git coderepos are stored in the cindex, but we should favor paths relative to $ENV{PWD} since it respects symlinks in the heirarchy. Respecting symlinks makes it easier to migrate cindex to new storage as old storage wears out and to relocate the storage device onto another machine.
2023-11-29git: speed up Git->new by 5% or so
This becomes noticeable when loading lots of coderepos on my local mirror of git.kernel.org now that we can load repos from cindex.
2023-11-29cindex: require `-g GIT_DIR' or `-r PROJECT_ROOT'
Accepting @ARGV without switches ends up being ambiguous with optional parameters for --join and --show. Requiring users to specify `--join=' or `--show=' is a bit awkward (as it with -clone --objstore= and the like, but that is historical baggage we need to carry at this point...)
2023-11-29git: speed up ->git_path for non-worktrees
Only worktrees need to use `git rev-parse --git-path', so avoid the spawn overhead of a new process. With the SolverGit.pm limit on coderepo scans disabled and scanning over 800 git repos for git@vger matches, this reduces up xt/solver.t times by roughly 25%.
2023-11-29www: load and use cindex join data
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more).
2023-11-29hval: use File::Spec to make relative paths for href
File::Spec->abs2rel doesn't touch the filesystem at all when given an absolute base arg ($env->{PATH_INFO}), so we can rely on it to generate relative links that work with the `mount' from Plack::Builder and also people running `wget -r' mirrors.
2023-11-29xap_helper: implement mset endpoint for WWW, IMAP, etc...
The C++ version will allow us to take full advantage of Xapian's APIs for better queries, and the Perl bindings version can still be advantageous in the future since we'll be able to support timeouts effectively.
2023-11-29xap_helper.h: move cindex endpoints to separate file
It ought to help a bit with organization since xap_helper.h is getting somewhat large and we'll need new endpoints to support WWW, lei, and whatever else that needs to come.
2023-11-29solver: schedule cleanup after synchronous git->check
We don't want hundreds of git cat-file processes for coderepos lingering around.
2023-11-29codesearch: eliminate redundant substitutions
We store the full path name and xap_terms already removes the `P' character, so the loop and substr calls are a no-op replacing `/' with `/'.
2023-11-29test_common: create_*: detect changes all parameters
Data::Dumper+B::Deparse seems fast enough to generate cache keys with, so this makes updating and developing tests easier (as opposed to forcing the developer to change the identifier). The main downside is we'll have to deal with cache expiration, but "make clean" seems overly aggressive already (it keeps blowing away the clones made by t/cindex-join.t :<)
2023-11-28disallow NUL characters in Message-ID and List-Id
While MTAs seem to stop '\0' from appearing in headers, users fetching archives via git remain susceptible to having '\0' land in archives. So we'll filter them out of Xapian and SQLite DBs to avoid interopability problems with CLI tools since there's no known messages in lore or any of my archives which feature them. Avoiding '\0' will ensure all indexed Message-IDs and List-Ids can be specified from the command-line (although some characters will still require $(printf) contortions). As with Message-ID, List-Id fields with /\n\t\r/ characters will also be stripped for indexing. I will assume whatever went wrong with the References: header in <https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw> could also happen to the List-Id header. This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4 (Import: Don't copy nulls from emails into git, 2018-07-07)
2023-11-27www: qs_html: fix escaping of `q' param
Our use of MID_ESC characters was only intended for the pathname component of URIs and not appropriate for the query string component. So use a different $unsafe parameter list for uri_escape to make the result appropriate for query strings by disallowing [\&\'\+=] characters. Most notably, this change also allows us to accept `/' (slash) unescaped to make dfn: queries nicer to look at. Finally, we'll also add a ascii_html call on the URI-escaped result as an extra safety measure even though it's not really needed. As far as I can tell, the code without this fix didn't result in in an HTML injection since all our uses of uri_escape did escape angle brackets. Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com> Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/ Tested-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
2023-11-27xap_helper.h: avoid some off_t vs size_t problems
We'll introduce a helper to cast off_t to size_t consistently for mmap/munmap/calloc calls which require size_t. Also, an extra check for multiplication overflow can be helpful just in case we end up with a gigantic file roots file.
2023-11-27xap_helper: avoid strerror(3) inside signal handler
It's not async-signal-safe and the glibc implementation uses malloc via asnprintf. Practically it's not a problem unless the kernel OOMs and the write(2) fails to the self-pipe.
2023-11-26drop redundant calls to DS->Reset
Reset gets called on END{} anyways to workaround DBI lifetime problems, so there's no need to call it near exit. We can't replace calls to POSIX::_exit with `exit' to force END{} to run just yet, as there are still some lingering destruction ordering problems on newer DBI and or Perls.
2023-11-26git: improve coupling with {sock} and {inflight} fields
While the {inflight} array should be tied to the IO object even more tightly, that's not an easy task with our current code. So take some small steps by introducing a gcf_inflight helper to validate the ownership of the process and to drain the inflight array via the awaitpid callback. This hopefully fix problems with t/lei-q-save.t (still) hanging occasionally on v2 outputs since git->cleanup/->DESTROY was getting called in v2 shard workers.