about summary refs log tree commit homepage
path: root/lib/PublicInbox/Git.pm
DateCommit message (Collapse)
2020-11-30git: ensure subclassed ->fail gets called
Some of these changes may not be strictly necessary, but it makes code easier to maintain and change. Hackers using/modifying this code will no longer wonder if a particular callsite needs to care about subclasses or not.
2020-11-30git: set non-blocking flag in case of other bugs
This makes GitAsyncCat more resilient to bugs in Gcf2 or even git-cat-file itself. I noticed -imapd stuck on read(2) from the Gcf2 pipe, so there may be a bug somewhere in Gcf2 or PublicInbox::Git. This should make us more resilient to them and hopefully help us notice and fix them.
2020-11-24git: add manifest_entry method
We'll be using this for MiscIdx and pre-generating the necessary JSON for manifest.js.gz, so make it easier to share code for generating per-repo JSON entries for grokmirror.
2020-10-17git: introduce async_wait_all
->cat_async and ->check_async may trigger each other (in future callers) while waiting, so we need a unified method to ensure both complete. This doesn't affect current code, but allows us to slightly simplify existing callers.
2020-10-16git: async: loop inflight checks for nested callbacks
We need to loop the inflight check for nested callback invocations to ensure we don't clog the pipe that feeds `git cat-file'. This bug was obscured by the fact that we're already accounting for 64-char git OIDs with SHA-256 in the pipe space calculation; perhaps we shouldn't do that.
2020-10-16git: *_async: support nested callback invocations
For external indices, we'll need to support nested cat_async invocations to deduplicate cross-posted messages. Thus we need to ensure we do not clobber the {inflight*} queues while stepping through and ensure {cat_rbuf} is stored before invoking callbacks. This fixes the ->cat_async-only case, but does not yet account for the mix of ->check_async interspersed with ->cat_async calls, yet. More work will be needed on that front at a later date.
2020-10-16git: ensure ->destroy clobbers check_async read buffer
It's currently not a problem as ->destroy doesn't happen for no reason, we'll need to ensure future uses of ->destroy correctly discard the check_async buffer.
2020-09-19gcf2: wire up read-only daemons and rm -gcf2 script
It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19gcf2: require git dir with OID
This amortizes the cost of recreating PublicInbox::Gcf2 objects when alternates change in v2 all.git.
2020-09-19gcf2: transparently retry on missing OID
Since we only get OIDs from trusted local data sources (over.sqlite3), we can safely retry within the -gcf2 process without worry about clients spamming us with requests for invalid OIDs and triggering reopens.
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-08-27git: show more context info on failures
I'm seeing "read: Connection timed out" from in my syslog from -httpd. The fail() calls in PublicInbox::Git seems to be the only code path of ours which could trigger it... ETIMEDOUT shouldn't happen on pipes, only sockets; and all of our socket operations are non-blocking. So this could be cgit-wwwhighlight-filter.lua, but that's connecting over localhost, though on fairly loaded HW.
2020-07-26imap: introduce and use Git->async_prefetch
We can keep the git process more active by sending another request to it while fetch_run_ops() is running. This parallelization speeds up mutt's initial FETCH for headers by around ~35%(!).
2020-07-25searchidx: support async git check
This allows v1 indexing to run while the `cat-file --batch-check' process is waiting on high-latency storage.
2020-07-06git: use v5.10.1, parent.pm and Time::HiRes::stat
parent.pm is leaner than base.pm, and Time::HiRes::stat is more accurate, so take advantage of these Perl 5.10+-isms since it's been over a year since we left 5.8 behind.
2020-06-25git_async_cat: remove circular reference
While this circular reference was carefully managed to not leak memory; it was still triggering a warning at -imapd/-nntpd shutdown due to the EPOLL_CTL_DEL op failing after the $Epoll FD gets closed. So remove the circular reference by providing a ref to `undef', instead.
2020-06-13git: async: automatic retry on alternates change
This matches the behavior of the existing synchronous ->cat_file method. In fact, ->cat_file now becomes a small wrapper around the ->cat_async method.
2020-06-13git: move async_cat reference to PublicInbox::Git
Trying to avoid a circular reference by relying on $ibx object here makes no sense, since skipping GitCatAsync::close will result in an FD leak, anyways. So keep GitAsyncCat contained to git-only operations, since we'll be using it for Solver in the distant feature.
2020-06-13git: cat_async: provide requested OID + "missing" on missing blobs
This will make it easier to implement the retries on alternates_changed() of the synchronous ->cat_file API.
2020-06-13git: idle rbuf for async
We do this for the C10K-oriented HTTP/NNTP/IMAP processes, and we may support thousands of git-cat-file processes in the future.
2020-06-13imap: use git-cat-file asynchronously
This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.
2020-06-13git: do our own read buffering for cat-file
To work with our event loop, we must perform read buffering ourselves or risk starvation, as there doesn't appear to be a way to check the amount of data buffered in userspace by by the PerlIO layers without resorting to C or XS. This lets us perform fewer syscalls at the expense of more Perl ops. As it stands, there seems to be a tiny performance improvement, but more will be possible in the future.
2020-06-13git: async: flatten the inflight array
Small array refs have considerable overhead in Perl, so reduce AV/SV overhead and instead allow the inflight array to grow twice as large.
2020-05-06git: warn on ->cat_async callback errors
This will help us track down bugs in our own code when it comes to missing error checking.
2020-04-29git: various minor speedups
While testing performance improvements elsewhere, I noticed some micro-optimizations could give a small ~2-3% speedup in my test using the git async API to parse a large inbox. The `read' perlfunc already has read-in-full behavior (unless git is killed unexpectedly), so there's no point in using a loop. SearchIdxShard in the parallel v2 indexing code path never looped on `read', either. Furthermore, we can avoid method dispatch overhead on ->getline and ->print by using `readline' and `print' as ops which can be resolved during the Perl compilation phase. Finally, avoid passing the IO handle around as a parameter, since avoiding hash lookups with a local variable has its own costs in stack and refcount bumping. Best off all, there's less code :>
2020-04-05git: reduce stat buffer storage overhead
The stat() array is a whopping 480 bytes (on x86-64, Perl 5.28), while the new packed representation of two 64-bit doubles as a scalar is "only" 56 bytes. This can add up when there's many inboxes. Just use a string comparison on the packed representation. Some 32-bit Perl builds (IIRC OpenBSD) lack quad support, so doubles were chosen for pack() portability.
2020-03-04git: remove POSIX::dup2 import
We rely on spawn/popen_rd for redirects, nowadays.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-13use popen_rd for bidirectional pipes
popen_rd accepts arbitrary redirects, so we can reuse its code to setup the pipe end we want to read, saving each caller a few lines of code compared to calling pipe+spawn.
2020-01-13git: packed_bytes: use GLOB_NOSORT
File::Glob is loaded by the perl for the "glob()" op, anyways, so call bsd_glob with the GLOB_NOSORT to avoid needless sorting of the output.
2020-01-13git: modified: don't slurp `rev-parse --branches'
While v1 inboxes typically only have one branch, code repositories may have dozens or even hundreds. Slurping those into memory is a waste.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-11git: remove ->commit_title method
We haven't used it in SolverGit, yet, and I'll be reworking it to work with ->cat_async, instead.
2020-01-11git: ->modified uses cat_async
While v1 inboxes are typically only a single branch, coderepos will have many branches and being able to pipeline requests to "git cat-file --batch" can help us mask seek times.
2020-01-11allow HTTP_HOST to be '0' via defined() checks
'0' is a valid value for HTTP_HOST, and maybe some folks will want to hit that as port 80 where the HTTP client won't send the ":$PORT" suffix.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2019-12-30spawn: allow passing GLOB handles for redirects
We can save callers the trouble of {-hold} and {-dev_null} refs as well as the trouble of calling fileno().
2019-12-26git: allow async_cat to pass arg to callback
This allows callers to avoid allocating several KB for for every call to ->async_cat.
2019-12-12git: async batch interface
This is a transitionary interface which does NOT require an event loop. It can be plugged into in current synchronous code without major surgery. It allows HTTP/1.1 pipelining-like functionality by taking advantage of predictable and well-specified POSIX pipe semantics by stuffing multiple git cat-file requests into the --batch pipe With xt/git_async_cmp.t and GIANT_GIT_DIR=git.git, the async interface is 10-25% faster than the synchronous interface since it can keep the "git cat-file" process busier. This is expected to improve performance on systems with slower storage (but multiple cores).
2019-10-22git: remove src_blob_url
This was intended for solver, but it's unused since commit 915cd090798069a4 ("solver: switch patch application to use a callback")
2019-09-14tmpfile: give temporary files meaningful names
Although we always unlink temporary files, give them a meaningful name so that we can we can still make sense of the pre-unlink name when using lsof(8) or similar tools on Linux.
2019-09-09run update-copyrights from gnulib for 2019
2019-07-08ds: use WNOHANG with waitpid if inside event loop
While we're usually not stuck waiting on waitpid after seeing a pipe EOF or even triggering SIGPIPE in the process (e.g. git-http-backend) we're reading from, it MAY happen and we should be careful to never hang the daemon process on waitpid calls. v2: use "eq" for string comparison against 'DEFAULT'
2019-06-14Merge remote-tracking branch 'origin/manifest' into next
* origin/manifest: git: ensure ->modified returns an integer www: support $INBOX/git/$EPOCH.git for v2 cloning www: wire up /$INBOX/manifest.js.gz, too wwwlisting: generate grokmirror-compatible manifest.js.gz wwwlisting: allow hiding entries from manifest
2019-06-14git: remove cat_file sub callback interface
We weren't using it, and in retrospect, it makes no sense to use this API cat_file for giant responses which can't read quickly with minimal context-switching (or sanely fit into memory for Email::Simple/Email::MIME). For giant blobs which we don't want slurped in memory, we'll spawn a short-lived git-cat-file process like we do in ViewVCS. Otherwise, monopolizing a git-cat-file process for a giant blob is harmful to other PSGI/NNTP users. A better interface is coming which will be more suitable for for batch processing of "small" objects such as commits and email blobs.
2019-06-10git: ensure ->modified returns an integer
We don't want to serialize timestamps as strings to JSON. I only noticed this bug on a 32-bit system.
2019-06-05tighten up digit matches to ASCII for git output
While I don't expect git to suddenly start spewing non-ASCII digits in places I'd expect ASCII, this would make things easier for future hackers and reviewers.
2019-06-01git: drop the deleted err_c file
No reason to leave that (usually) empty file open after killing off "cat-file --batch-check". This wasn't an unbound leak, though, as respawning the --batch-check process would've clobbered the old err_c file.
2019-06-01git: unconditional expiry
A constant stream of traffic to either httpd/nntpd would mean git-cat-file processes never expire. Things can go bad after a full repack, as a full repack will unlink old pack indices and git-cat-file does not currently detect unlinked files. We could do something complicated by recursively stat-ing objects/pack of every git directory and alternate; but that's probably not worth the trouble compared to occasionally restarting the cat-file process. So simplify the code and let httpd/nntpd expire them periodically, since spawning a "git-cat-file --batch" process isn't too expensive. We already spawn for every request which hits git-http-backend, cgit, and git-apply. In the future, we may optionally support the Git::Raw module to avoid IPC; but we must remain careful to not leave lingering FDs open to unlinked files after repack.
2019-05-22git: workaround old git-rev-parse(1) (--git-path)
git < 2.5.0 was missing --git-path support. This means any users relying on some rare environment variables will need git 2.5.0+