about summary refs log tree commit homepage
path: root/lib/PublicInbox/Git.pm
DateCommit message (Collapse)
2024-05-09treewide: reduce $PATH checks for `git' executable
Repeatedly checking $PATH for `git' when we need to call it multiple times in quick succession doesn't seem useful. So avoid some expensive stat(2) syscalls to make things less bad for systems which require expensive CPU vulnerability mitigations. This also saves a bunch of memory allocations since we do the $PATH lookup in pure Perl to avoid doing the uncacheable lookup in a vfork-ed child.
2024-04-03treewide: avoid getpid for more ownership checks
There are still some places where on_destroy isn't suitable, This gets rid of getpid() calls in most of those cases to reduce syscall costs and cleanup syscall trace output.
2024-03-08dedupe inbox names, coderepo nicks + git dirs
Inbox names, coderepo nicks, git_dir values are used heavily as hash keys by the read-only coderepo WWW pieces. Relying on CoW for mutable scalars on newer Perl doesn't work well since CoW for those scalars are limited to 256 CoW references and blow past that number when mapping thousands of coderepos and inboxes to each other. Instead, make the hash key up-front and get the resulting string to point directly to the pointer used by the hash key.
2024-01-10git: lowercase host in host_prefix_url
This will make it more effective for use as a cache key. I'm not entirely happy with this sub being in the Git module since it's used by lei and command-line tools, but that's for another day to deal with...
2024-01-10git: workaround occasional -watch error message
I'm not sure how this happens (perl 5.34.1 on FreeBSD 13.2) but it appears the {sock} check can succeed and then go undef and become unable to call ->owner_pid. This happens when libgit2 is in use, so perhaps that's a factor. In any case, the rest of the tests succeed.
2023-12-16git: quiet down `rev-parse --git-path' errors
This fixes t/mda.t with git 1.8.5
2023-11-30git: share unlinked pack checking code with gcf2
It saves some code in case we keep libgit2 around.
2023-11-29git: speed up Git->new by 5% or so
This becomes noticeable when loading lots of coderepos on my local mirror of git.kernel.org now that we can load repos from cindex.
2023-11-29git: speed up ->git_path for non-worktrees
Only worktrees need to use `git rev-parse --git-path', so avoid the spawn overhead of a new process. With the SolverGit.pm limit on coderepo scans disabled and scanning over 800 git repos for git@vger matches, this reduces up xt/solver.t times by roughly 25%.
2023-11-29solver: schedule cleanup after synchronous git->check
We don't want hundreds of git cat-file processes for coderepos lingering around.
2023-11-26git: improve coupling with {sock} and {inflight} fields
While the {inflight} array should be tied to the IO object even more tightly, that's not an easy task with our current code. So take some small steps by introducing a gcf_inflight helper to validate the ownership of the process and to drain the inflight array via the awaitpid callback. This hopefully fix problems with t/lei-q-save.t (still) hanging occasionally on v2 outputs since git->cleanup/->DESTROY was getting called in v2 shard workers.
2023-11-26git: move rbuf handling to PublicInbox::IO
The long-term plan is to share non-blocking read buffering logic with HTTP/NNTP/IMAP/POP3 and also XapClient.
2023-11-20git: return upon self->close
I encountered the odd lack of `return' while chasing Gcf2 bugs on CentOS 7.x which resulted in commit 7d06b126e939 ("gcf2: fix autodie usage for older Perl") and commit e618c7654794 ("gcf2client: add alias for PublicInbox::Git::fail") before realizing the lack of `return' here wasn't the culprit behind failures on CentOS 7.x. However, the use of a `return' here appears required in case we actually hit the error path, since falling through and attempting my_readline with an undefined filehandle is always a failure. Fixes: e97a30e7624d ("lei: fix SIGPIPE on large result sets to pager")
2023-11-07lei: fix SIGPIPE on large result sets to pager
When dealing with large search results, we need to deal with EPIPE not just from the pager, but also EPIPE or ECONNRESET between lei_xsearch and lei2mail processes. Without this fix, lei_xsearch processes could linger and get stuck writing to dead lei2mail processes if a user aborts the pager early during a large result set. To ensure lei_xsearch processes don't linger around after lei2mail workers all die, we must close $l2m->{-wq_s2} before spawning lei_xsearch processes, since $l2m->{-wq_s2} is only used in lei2mail workers. For `git cat-file' processes, we also need to trigger PublicInbox::Git->close to handle unpredictable destructor ordering to avoid using uninitialized IO refs. This combines with the `git_to_mail' change to deal with process cleanup handling from premature shutdowns. To test all this, we can't just rely on a single message being large, but also need to rely on the result set being large enough to saturate the lei_xsearch -> lei2mail socket so we rely on GIANT_INBOX_DIR once again.
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-11-03replace ProcessIO with untied PublicInbox::IO
This fixes two major problems with the use of tie for filehandles: * no way to do fcntl, stat, etc. calls directly on the tied handle, forcing callers to use the `tied' perlop to access the underlying IO::Handle * needing separate classes to handle blocking and non-blocking I/O As a result, Git->cleanup_if_unlinked, InputPipe->consume, and Qspawn->_yield_start have fewer bizzare bits and we can call `$io->blocking(0)' directly instead of `(tied *$io)->{fh}->blocking(0)' Having a PublicInbox::IO class will also allow us to support custom read buffering which allows inspecting the current state.
2023-11-02treewide: use ->close method rather than CORE::close
It's easier-to-read and should open the door for us to get rid of `tie' for ProcessIO without performance penalties for more frequently-used perlop calls and ability to do `stat' directly on the object instead of the awkward `tied' thing.
2023-11-01git: reschedule cleanup if active
This is necessary to reliably cleanup cat-file processes for coderepos in long-lived -netd and -httpd processes if they haven't been accessed in a while. Followup-to: 33e99002c552 (git: cleanup un-associated coderepo processes)
2023-10-28git: avoid extra stat(2) for git version
No sane installer will update executable files in place due to ETXTBSY on execve. So save ourselves a stat(2) call by relying on the special `CORE::stat(_)' case to reuse the cached result from the `-x FILE' filetest in which().
2023-10-28git: use run_qx to read `git --version'
It exists, now, so save us a few lines of code.
2023-10-26git: cleanup un-associated coderepo processes
It's possible to have many coderepos with no inbox association that never see git->cleanup. So instead of tying git->cleanup to inboxes, ensure it gets armed when ->watch_async is called (since it's only called in our -netd or -httpd servers).
2023-10-25lei_mirror+fetch: don't slurp `git show-ref' output
While uncommon, some git repos have hundreds of thousands of refs and slurping that output into memory can bloat the heap. Introduce a sha_all sub in PublicInbox::SHA to loop until EOF and rely on autodie for checking sysread errors.
2023-10-25spawn: support synchronous run_qx
This is similar to `backtick` but supports all our existing spawn functionality (chdir, env, rlimit, redirects, etc.). It also supports SCALAR ref redirects like run_script in our test suite for std{in,out,err}. We can probably use :utf8 by default for these redirects, even.
2023-10-18use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads. This saves us some code to reduce cognitive overhead for readers. We'll also support reusing a destination buffers so it can work more nicely with existing code.
2023-10-18git: introduce read_all function
This makes it easier to improve error checking, since the `do { local $/; readline(FH) }' construct does not detect errors (autodie does not cover `readline' or `<FH>'). I'm not sure exactly where this should be, but PublicInbox::Git is used nearly everywhere in our code base and it's probably not worth creating a new package for it.
2023-10-08introduce ProcessIONBF for multiplexed non-blocking IO
This is required for reliable epoll/kevent/poll/select wakeup notifications, since we have no visibility into the buffer states used internally by Perl. We can safely use sysread here since we never use the :utf8 nor any :encoding Perl IO layers for readable pipes. I suspect this fixes occasional failures from t/solver_git.t when retrieving the WwwCoderepo summary.
2023-10-08rename ProcessPipe to ProcessIO
Since we deal with pipes (of either direction) and bidirectional stream sockets for this class, it's better to remove the `Pipe' from the name and replace it with `IO' to communicate that it works for any form of IO::Handle-like object tied to a process.
2023-10-01git: packed_bytes: deal with glob+stat TOCTTOU
There's not much we can do about this aside from just ignoring errors and considering un-stat-able files as zero-sized. There's no syscalls which expose FUSE3 `readdirplus' type functionality to userspace to avoid this problem.
2023-10-01git: improve error reporting
We can use autodie for socketpair to handle errors for us, but we need Time::HiRes::stat so we must write the error message ourselves if stat-ing the git executable fails.
2023-10-01git+gcf2client: switch to level-triggered wakeups
Instead of using ->requeue to emulate level-triggered wakeups in userspace, just use level-triggered wakeups in the kernel to save some user time at the expense of system (kernel) time. Of course, the ready list implementation in the kernel via C is faster than a Perl one on our end. We must still use requeue if we've got buffered data, however. Followup-to: 1181a7e6a853 (listener: switch to level-triggered epoll)
2023-10-01git: use Unix stream sockets for `cat-file --batch-*'
The benefit of 1MB potential pipe buffer size (on Linux) doesn't seem noticeable when reading from git (unlike when writing to v2 shards), so Unix stream sockets seem fine, here. This allows us to simplify our process management by using the same socket FD for reads and writes and enables us to use our ProcessPipe class for reaping (as we can do with Gcf2Client). Gcf2Client no longer relies on PublicInbox::DS for write buffering, and instead just waits for requests to complete once the number of inflight requests hits the MAX_INFLIGHT threshold as we do with PublicInbox::Git. We reuse the existing MAX_INFLIGHT limit (18) that was determined by the minimum allowed PIPE_BUF (512). (AFAIK) Unix stream sockets have no analogy to PIPE_BUF, but all *BSDs and Linux I've checked have default SO_RCVBUF and SO_SNDBUF values larger than the previously-required PIPE_BUF size of 512 bytes.
2023-10-01git: decouple cat_async_retry from POSIX pipe semantics
While pipes guarantee writes of <= 512 bytes to be atomic, Unix stream sockets (or TCP sockets) have no such guarantees. Removing the pipe assumption will make it possible for us to switch to bidirectional Unix stream sockets and save FDs with `git cat-file' processes as we have with Gcf2Client. The performance benefit of larger pipe buffers over stream sockets isn't irrelevant when interacting with git as it is with SearchIdx shards.
2023-09-30git: fix unused code path for cat-file stderr reset
We haven't used _bidi_pipe idempotently in a while, so the stderr was never getting reset on reads. This isn't fully useful when using async eeeae20893a25956 (imap: use git-cat-file asynchronously, 2020-06-10) So instead of truncating it on reads, we'll truncate immediately after reading and rely on O_APPEND to keep new writes at the end. Fortunately, this stderrr error checking isn't used outside of solver (which is synchronous).
2023-09-29git: calculate MAX_INFLIGHT properly in Perl
Unlike C, Perl automatically converts quotients to double-precision floating point even with UV/IV numerators and denominators. So force the intermediate quotient to be an integer before multiplying it by the size of each inflight array element. This bug was inconsequential for all platforms since d4ba8828ab23f278 (git: fix asynchronous batching for deep pipelines, 2023-01-04) and inconsequential on most (or all?) Linux even before that due to the larger 4096-byte PIPE_BUF on Linux.
2023-04-29git: make check_async callbacks identical to cat_async
This simplifies Git->cat_async_step and fixes Git->async_abort, the latter of which was passing arguments improperly for the --batch-check (or `info') case at the cost of making the few check_async callers handle an extra argument. The extra (PublicInbox::Git) $self argument for check_async callbacks is now gone, as avoiding the temporary cyclic reference doesn't seem worthwhile since the temporary cyclic reference appears in the ->cat_async code paths, too.
2023-04-20cindex: support sha256 coderepos alongside sha1
This special support is only needed for --prune at the moment since the indexing side works on a per-repo basis. There's no automated tests, yet, but it seems to work well on my sha256 projects when sharing a cindex with sha1 projects.
2023-04-12git: parallelize manifest_entry
This saves a few milliseconds per-epoch without incurring any dependencies on the event loop. It can be parallelized further, of course, but it may not be worth it for -extindex users since it's already cached.
2023-04-12git: rename version() to git_version()
In case it gets confused with Inbox->version or similar.
2023-04-12git: cat_async_step: reduce batch-command info checks
This improves readability for me. Instead of checking for `info ' requests of `--batch-command' in multiple places of every common branch, do it once per-call and stash its result. We'll also avoid storing `$bc' for now since the only other check is in a cold path.
2023-04-11git: fix cat_async_retry
Retrying requests on alternates changing was causing inflight requests to get lost due to {inflight} getting clobbered by batch_prepare. Unfortunately, reproducing this is difficult without mocking ->alternates_changed. SearchIdx now avoids calling ->batch_prepare directly and relies on more common API functions. Fixes: 65db62eb006f ("git: use --batch-command in git 2.36+ to save processes")
2023-04-07switch git version comparisons to vstrings, too
There's too many require_git callsites in t/*.t to change, but we can make the rest of the code more readable and reuse PublicInbox::Git::version() in our test suite, too.
2023-03-27git: check for --version errors
While unlikely, `git --version' may fail, so we must check for errors and by reaping the process ASAP via tied close().
2023-02-03lei_mirror: use --no-write-fetch-head on git 2.29+
This avoids unnecessary writes to the FETCH_HEAD file, which is worthless in multi-remote mirrors. Actually, I haven't found FETCH_HEAD useful anywhere since the `/remotes/' namespace became popular...
2023-01-30use Net::SSLeay (OpenSSL) for SHA-(1|256) if installed
On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as the Digest::SHA implementation from Perl, most likely due to an optimized assembly implementation. SHA-1 is a few percent faster, too.
2023-01-27git: drop needless checks for old git
`ambiguous' was added in git 2.21, and `dangling' was the only other possible phrase which was inadvertantly slipped in prior to 2.21. Thus there's no need to check for `notdir' or `loop' responses since we aren't using `git cat-file --follow-symlinks' anywhere.
2023-01-27git: use --batch-command in git 2.36+ to save processes
`git cat-file --batch-command' combines the functionality of `--batch' and `--batch-check' into a single process. This reduces the amount of running processes and is primarily useful for coderepos (e.g. solver). This also fixes prior use of `print { $git->{out} }' which is a a potential (but unlikely) bug since commit d4ba8828ab23f278 (git: fix asynchronous batching for deep pipelines, 2023-01-04) Lack of libgit2 on one of my test machines also uncovered fixes necessary for t/imapd.t, t/nntpd.t and t/nntpd-v2.t.
2023-01-26git: reduce delete ops in _destroy
We can avoid some extra returns and branches by just relying on variadic arguments.
2023-01-26git: drop needless ENOENT import
I imported it in commit 356439a571c536eaa487031802b436d087113f4f (gcf2 + extsearch: check for unlinked files on Linux, 2021-09-22) but never used it.
2023-01-26process_pipe: warn hackers off using it for bidirectional pipes
While most uses of ->DESTROY happens in a predictable order in long-lived daemons, process teardown on exit is chaotic and not subject to ordering guarantees, so we must keep both ends of a `git cat-file --batch*' pipe at the same level in the object hierarchy. Drop an old Carp import while I'm in the area.
2023-01-26git: use core.abbrev=no on git 2.31+
This makes it easier to support SHA-256 inboxes in the future. Tested with both git 2.30.2 (Debian stable) and 2.39.1