about summary refs log tree commit homepage
path: root/lib/PublicInbox/SolverGit.pm
DateCommit message (Collapse)
2024-03-12www: use a dedicated limiter for blob solver
Wrap the entire solver command chain with a dedicated limiter. The normal limiter is designed for longer-lived commands or ones which serve a single HTTP request (e.g. git-http-backend or cgit) and not effective for short memory + CPU intensive commands used for solver. Each overall solver request is both memory + CPU intensive: it spawns several short-lived git processes(*) in addition to a longer-lived `git cat-file --batch' process. Thus running parallel solvers from a single -netd/-httpd worker (which have their own parallelization) results in excessive parallelism that is both memory and CPU-bound (not network-bound) and cascade into slowdowns for handling simpler memory/CPU-bound requests. Parallel solvers were also responsible for the increased lifetime and frequency of zombies since the event loop was too saturated to reap them. We'll also return 503 on excessive solver queueing, since these require an FD for the client HTTP(S) socket to be held onto. (*) git (update-index|apply|ls-files) are all run by solver and short-lived
2023-11-29www: load and use cindex join data
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more).
2023-11-29solver: schedule cleanup after synchronous git->check
We don't want hundreds of git cat-file processes for coderepos lingering around.
2023-11-03io: introduce write_file helper sub
This is pretty convenient way to create files for diff generation in both WWW and lei. The test suite should also be able to take advantage of it.
2023-10-25httpd/async: require IO arg
Callers that want to requeue can call PublicInbox::DS::requeue directly and not go through the convoluted argument handling via PublicInbox::HTTPD::Async->new.
2023-06-09add compat package for List::Util::uniqstr
This will make it easier to switch in the far future while making callers easier-to-read (and more callers will be added). Anyways, Perl 5.26 is a long time away for enterprise users; but isolating compatibility code away can improve readability of code we actually care about in the meantime.
2023-05-01solver_git: don't spew to daemon err on git apply failure
Too many patches don't apply (due to coderepos being a PITA to associate) and interested admins can check for 404s to diagnose them, anyways. This reduces the noise in syslog/stderr for public-facing daemons.
2023-01-24solver_git: remove extraneous leading `-'
It was a harmless negation, I must've pasted a line from a diff and forgotten to chop off the first character :x Fixes: 6f5b238bae5c "solver: early make hints detection more robust"
2022-12-15solver_git: more descriptive error for "git apply" failures
This happens quite often on my systems due to scrapers, unfortunately.
2022-10-05www_coderepo: wire up /$CODEREPO/$OID/s/ endpoint
Just reusing ViewVCS::show, since encoding refname and pathnames into things just makes things slower.
2022-09-11solver: do not show redundant URLs in log
Messages in /all/ can get duplicated at times due to list-appended signatures or buggy/malicious clients. They'll all show up based on /$INBOX/$MSGID/, so deduplicate the URLs to avoid noise.
2022-09-02solver: do not count duplicates in patch count
We're considering duplicate patches from cross-posted lists identical, so don't double-count them when displaying the "applying [X/Y]" message since (successful) duplicates get skipped.
2022-09-02solver: handle copies properly
At least enough to get 66 patches applied to handle /lore/all/34d644a519c/s/?b=target/arm/helper-mve.h properly. I noticed this bug due to a: E: BUG: extra files in index: <100644 e91f526a1a83edb2b56798388a355b1c3729b4bd 0#011target/arm/translate-mve.c> line in my syslog However, the $TOTAL in "applying [X/$TOTAL]" in the debug log is seems off...
2022-08-29solver: early make hints detection more robust
Hints fields can change, so we'll use a simple boolean rather than checking a static count. We'll also short-circuit out reliably regardless of hints when a full OID is given.
2022-08-29solver: create tmpdir lazily
"lei blob" doesn't currently need it at all in some cases, and the next commit will allow viewvcs to share tmpdirs to show commits as HTML.
2022-08-23qspawn: improve error reporting and handling
First off, avoid potential circular references (via {qx_arg}) by dropping the {-qsp} field from $ctx and SolverGit objects. Instead, we only share a reference to an optional error buffer string {qsp_err}. We'll also attempt to call qspawn.wcb if qx_cb fails, and warn in more places w/o checking for $env since we now rely on warn() instead of $env->{'psgi.errors'}. This makes error handling simpler and safer in future callers.
2022-07-30solver: avoid deprecation warnings in git 2.36.0+
git deprecated core.fsyncObjectFiles in favor of core.fsync with 2.36.0+, while GIT_TEST_FSYNC was added in 2.35.0. So use the environment variable since it's been supported slightly longer than the new configuration knob.
2021-11-10solver: support sha256 coderepos
Tested manually on a newish project I'm working on.
2021-10-09solver_git: shorten scalar lifetimes
Some of these scalar buffers may be large patches, so try to keep them as short-lived as possible to reduce memory pressure.
2021-06-24favor git(1) rather than libgit2 for ExtSearch
While both git and libgit2 take around 16 minutes to load 100K alternates there's already a proposed patch to make git faster: <https://lore.kernel.org/git/20210624005806.12079-1-e@80x24.org/> It's also easier to patch and install git locally since the git.git build system defaults to prefix=$HOME and dealing with dynamic linking with libgit2 is more difficult for end users relying on Inline::C. libgit2 remains in use for the non-ALL.git case, but maybe it's not necessary (libgit2 is significantly slower than git in Debian 10 due to SHA-1 collision checking).
2021-03-28treewide: shorten temporary filename
File::Temp only requires four 'X' characters (unlike mkstemp(3), which requires six). So only so only give it 4 to avoid an 80-column violation and maybe save metadata space on FSes.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-09-12treewide: avoid `goto &NAME' for tail recursion
While Perl implements tail recursion via `goto' which allows avoiding warnings on deep recursion. It doesn't (as of 5.28) optimize the speed of such dispatches, though it may reduce ephemeral memory usage. Make the code less alien to hackers coming from other languages by using normal subroutine dispatch. It's actually slightly faster in micro benchmarks due to the complexity of `goto &NAME'.
2020-09-10solver: async blob retrieval for diff extraction
Like the rest of the WWW code, public-inbox-httpd now uses git_async_cat to retrieve blobs without blocking the event loop. This improves fairness when git blobs are on slow storage and allows us to take better advantage of SMP systems.
2020-09-10solver: break apart inbox blob retrieval
To avoid hogging the event loop in public-inbox-httpd when many candidate messages match, we'll separate the steps to ensure fairness on slow storage.
2020-09-10solver: check one git coderepo and inbox at a time
With public-inbox-httpd, this mitigates the effect of slow git blob storage with multiple coderepos configured for an inbox. It's still synchronous for now (and may need to remain that way for ->last_check_err), but no longer monopolizes the event loop when checking multiple coderepos. We don't yet support multi-inbox scanning, yet; but this also prepares us for a future where we do. We'll also support >=40 char blob OIDs in preparation for future git SHA-256 support, too.
2020-09-10solver: drop warnings, modernize use v5.10.1, use SEEK_SET
With Perl upstream preparing to deprecate things, we'll move towards only enabling warnings during development via shebang and stop enabling them via "use". We'll also favor "use v5.10.1" over the Perl 5.6-compatible "use 5.010_001", since our code base never worked on 5.6. Finally, were also importing SEEK_SET without using it, just use it for readability since we can't avoid loading Fcntl in other places and it'll get constant-folded, anyways.
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-07-10viewvcs: stop checking unused "B" query parameter
The resulting OID ("oid_b") is a required arg and part of $env->{PATH_INFO}, instead; so it's never part of an optional query parameter.
2020-06-03inbox: introduce smsg_eml method
The goal of this is to eventually remove the $smsg->{mime} field which is easy-to-misuse and cause memory explosions which necessitated fixes like commit 7d02b9e64455831d ("view: stop storing all MIME objects on large threads").
2020-05-09msg_iter: make ->each_part method for PublicInbox::MIME
The reliance on Email::MIME->subparts is a tad inefficient with a work-in-progress module to replace Email::MIME. So move towards using ->each_part as a class-specific iterator which can take advantage of more class-specific optimizations in the yet-to-be-revealed PublicInbox::Eml and PublicInbox::Gmime classes. The msg_iter() sub remains for compatibility with existing 3rd-party scripts/modules which use our small public Perl API and Email::MIME.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-20www: avoid `state' usage to perform allocations up-front
We want WWW->preload to get as many immortal allocations done as possible, and the `state' feature from Perl 5.10 prevents that.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-01solver: join multiple URLs with "||"
It seems to make sense to the target audience that any of the URLs displayed could work.
2020-01-25spelling: favor `publicly' over `publically'
While both can be correct, the former seems more common, is shorter, and is also consistent with the spelling found in the AGPL-3.0 text.
2020-01-13solver: path_a may be undef from /dev/null
This avoids uninitialized variable warnings when viewing newly-created files.
2020-01-12www: discard multipart parent on iteration
We're often iterating through messages while writing to another buffer in our WWW interface, causing memory usage to multiply. Since we know we won't need to keep the MIME object around in some cases, and can tell msg_iter to clobber the on-stack variable while it operates on subparts of multipart messages. With xt/mem-msgview.t switched to multipart from the previous commit, this shows a 13 MB memory reduction on that test.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-04solver: allow literal '\r' character in diff lines
While filenames are escaped, the actual diff contents may contain an unescaped "\r" carriage return byte not in front of the "\n" line feed. So just allow "\r" to appear in the middle of a line.
2020-01-04solver: minor cleanups to diff extraction
Initialize the $di hashref at use to make it more obvious it's a local variable. We can also use the :utf8 IO layer via open+print to save ourselves the trouble of converting the UTF-8 patch to an octet stream.
2020-01-04solver: do not enforce order on extended headers
This is needed to work with patches with many renames, such as what makes "git/eebf7a8/s/?b=t%2Ftest-lib.sh"
2020-01-03qspawn: use per-call quiet flag for solver
solver can spawn multiple processes per HTTP request, but "git apply" failures are needlessly noisy due to corrupt patches. We also don't want to silence "git ls-files" or "git update-index" errors using $env->{'qspawn.quiet'}, either, so this granularity is needed. Admins can check for 500 errors in access logs to detect (and reproduce) solver failures, anyways, so there's no need to log every time "git apply" rejects a corrupt patch.
2020-01-03solver: extract_diff: deal with missing "diff --git" line
Rewrite the patch extraction loop using a single regexp which accounts for missing "diff --git ..." lines and is capable of extracting pathnames off the "+++ b/foo" line. This fixes the solving of blob "96f1c7f" off <2841d2de-32ad-eae8-6039-9251a40bb00e@tngtech.com> in git@vger archives. v2: * Fix regressions in git@vger archives: - git/776fa90f7f/s/?b=contrib/git-jump/git-jump (fallback to "old mode" properly) - git/5cd8845/s/?b=submodule.c (no leading space in context) * use "state" in a Perl <5.28.0-compatible way
2020-01-03solver: try the next patch on apply failures
Sometimes a patch is corrupted and resent to create the same OID. We need to account for that case and actually move onto the next patch instead of blindly trying "git ls-files" to get nothing out of it.
2019-12-30spawn: support chdir via -C option
This simplifies our admin module a bit and allows solver to be used with v1 inboxes using git versions prior to v1.8.5 (but still >= git v1.8.0).
2019-12-30spawn: allow passing GLOB handles for redirects
We can save callers the trouble of {-hold} and {-dev_null} refs as well as the trouble of calling fileno().
2019-12-28solvergit: allow passing arg to user-supplied callback
This allows us to get rid of the requirement to capture on-stack variables with an anonymous sub, as illustrated with the update to viewvcs to take advantage of this. v2: fix error handling for missing OIDs
2019-12-26msg_iter: provide means to stop using anonymous subs
And remove the last anonymous sub in SolverGit itself.