about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiMirror.pm
DateCommit message (Collapse)
2021-10-15lei + ipc: simplify process reaping
Simplify our APIs and force dwaitpid() to work in async mode for all lei workers. This avoids having lingering zombies for parallel searches if one worker finishes soon before another. The old distinction between "old" and "new" workers was needlessly complex, error-prone, and embarrasingly bad. We also never handled v2:// writers properly before on Ctrl-C/Ctrl-Z (SIGINT/SIGTSTP), so add them to @WQ_KEYS to ensure they get handled by $lei when appropropriate.
2021-10-15lei: TSTP affects all curl and related subprocesses
By relying more on pgroups for remaining remaining processes, this lets us pause all curl+tail subprocesses with a single kill(2) to avoid cluttering stderr. We won't bother pausing the pigz/gzip/bzip2/xz compressor process not cat-file processes, though, since those don't write to the terminal and they idle soon after the workers react to SIGSTOP. AutoReap is hoisted out from TestCommon.pm. CLONE_SKIP is gone since we won't be using Perl threads any time soon (they're discouraged by the maintainers of Perl).
2021-10-15lei: give workers their own process group
This lets users Ctrl-Z from their terminal to pause an entire git-clone process hierarchy.
2021-10-14lei add-external --mirror: respect client umask
While lei is intended for non-public mail and runs umask(077) by default, externals are one area which can safely defer to the user's umask. Instead of sending it unconditionally with every command, only have lei-daemon request it when necessary.
2021-10-14clone+fetch: respect umask for all downloaded files
Since public inboxes are usually intended to be public, the File::Temp default permission of 0600 is wrong. Just respect the user's umask in this case as git-clone does. This doesn't work for "lei add-external --mirror", yet; but it will...
2021-10-13lei: use standard warn() in more places
warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-10-12www: _/text/config/raw Last-Modified: is mm->created_at
This allows IMAP mirrors to keep UIDVALIDITY synchronized (and "LIST ACTIVE.TIMES" in NNTP). "lei add-external --mirror" will automatically set it, as will the combination of public-inbox-clone + public-inbox-index. This avoids the need for extra endpoints or config entries, at least...
2021-09-25lei add-external: split into separate file
Also was written before we had auto-loading and rarely used.
2021-09-24clone|fetch|--mirror: cull manifest in partial mirrors
This makes it easier for users to enable fetching on a previously read-only epoch. Prior to this change, users were required to delete manifest.js.gz in addition to adding the writable bit. Now, they just have to "chmod +w $EPOCH_DIR".
2021-09-24clone|--mirror: fix and test against pre-manifest WWW
There may still be pre-manifest.js.gz versions of PublicInbox::WWW. running and serving v2 inboxes. Since $INBOX_URL/manifest.js.gz was not understood, it was assumed to be a Message-ID and 301-ed to "$INBOX_URL/manifest.js.gz/" with a trailing slash, so our 404 checks were invalid. Update our fallbacks to deal with 301 by catching JSON decoding errors to trigger HTML scraping. For HTML parsing, be sure to not be fooled by potential user-generated content and only scan the part after the last <hr>. We also need to avoid propagating $? from curl unnecessarily when we can continue safely. Finally, update v2mirror.t with tests to use PublicInbox::WWW from our "v1.1.0-pre1" tag to ensure these code paths get tested
2021-09-24clone|--mirror: support --epoch=RANGE for partial clones
Partial (v2) clones should be useful addition for users wanting to conserve storage while having fast access to recent messages. Continuing work started in 876e74283ff3 (fetch: ignore non-writable epoch dirs, 2021-09-17), this creates bare, read-only epoch git repos. These git repos have the remotes pre-configured, but does not fetch any objects. The goal is to allow users to set the writable bit on a previously-skipped epoch and start fetching it. Shell completion support may not be necessary given how short the epoch ranges are, here. Cc: Luis Chamberlain <mcgrof@kernel.org> Link: https://public-inbox.org/meta/20210917002204.GA13112@dcvr/T/#u
2021-09-22lei: drop redundant WQ EOF callbacks
Redundant code is noise and therefore confusing :<
2021-09-15fetch|clone|--mirror: shorten paths for progress output
The full pathname for "curl -o ..." was too noisy and confusing. Reduce confusion by adding the ".tmp" suffix and relying on "-C". We'll also avoid displaying "-C" in run_reap() and rely on "--git-dir=" with "git fetch" to display progress for users.
2021-09-15clone|fetch|--mirror: add convenience $INBOX_DIR/Makefile
Since the beginning of time, I've been dropping Makefiles in $INBOX_DIR (and above hiearchies) to organize groups of commands. make(1) is widely available in various flavors and a familiar tool for our target audience. It is easy to run in the right directory, typically has built-in shell completion, and doesn't silently ignore errors by default like Bourne shell.
2021-09-15multi_git: hoist out common epoch/alternates handling
IMHO, this greatly improves code sharing and organization between v2, extindex, and lei/store. Common git-related logic for these is lightly-refactored and easier to reason about. The impetus for this big change was to ensure inboxes created+managed by public-inbox-{clone,fetch} could have alternates and configs setup properly without depending on SQLite (via V2Writable). This change does that while making old code shorter and better factored.
2021-09-12fetch: drop 304 Not Modified support, simplify comparisons
Timestamp comparisons only have 1 second granularity, which isn't nearly enough for our test cases, and probably not for real world use for "git send-email" bursts and fast SMTP servers. We'll continue to check modification times inside the manifest, though, in case an extremely rare SHA-1 collision is found...
2021-09-12fetch: fix half-baked v1 manifest.js.gz handling
The v1 code path was totally half-baked after the change to use manifest.js.gz :x Fixes: ffb7fbda6869db4b ("fetch: use manifest.js.gz for v1")
2021-09-12fetch: use manifest.js.gz for v1
This is gentler to the remote HTTP server in the no-op case and will allow client migrations to some v2-ish format without forcing the client to redownload everything.
2021-09-12clone|lei_mirror: write description in mirrors
Instead of generic "Unnamed repository" or "missing" messages, show "mirror of $URL" since it seems like a better default when creating a mirror.
2021-09-12new public-inbox-{clone,fetch} commands
Setting up and maintaining git-only mirrors of v2 inboxes is complex since multiple commands are required to clone and fetch into epochs. Unlike grokmirror, these commands do not require any configuration. Instead, they rely on existing git config files and work like "git clone --mirror" and "git fetch", respectively. Like grokmirror, they use manifest.js.gz, but only on a per-inbox basis so users won't have to clone every inbox of a large instance nor edit config files to include/exclude inboxes they're interested in.
2021-09-12lei_mirror: fix error message
We're using rename(2) rather than link(2)
2021-09-12lei_mirror: simplify error reporting
Slowly transitioning to using die() more, which hopefully improves code reusability between lei and non-lei parts of our code.
2021-09-11lei: pass client stderr to git-config in more places
This should improve the users' chances of seeing errors in various git config files we use.
2021-09-10lei add-external --mirror: quiet unlink error on ENOENT
If the mirror.done file doesn't exist for unlink, it's because we already got another error, so don't confuse users by noting an unlink error since the ENOENT is expected in the face of other errors.
2021-09-10lei add-external --mirror: deduce paths for PSGI mount prefixes
The current manifest.js.gz generation in WWW doesn't account for PSGI mount prefixes (and grokmirror 1.x appears to work fine). In other words, <https://yhbt.net/lore/lkml/manifest.js.gz> currently has keys like "/lkml/git/0.git" and not "/lore/lkml/git/0.git" where "/lore" is the PSGI mount prefix. This works fine with the prefix accounted for in my grokmirror (1.x) repos.conf like this: site = https://yhbt.net/lore/ manifest = https://yhbt.net/lore/manifest.js.gz Adding the PSGI mount prefix in manifest.js.gz is probably not desirable since it would force the prefix into the locally cloned path by grokmirror, and all the cloned directories would have the remote PSGI mount prefix prepended to the toplevel. So, "lei add-external --mirror" needs to account for PSGI mount prefixes by deducing the prefix based on available keys in the manifest.js.gz hash table.
2021-06-08lei: generalize auxiliary WQ handling
op_wait_event is now more lei-specific since we no longer have to care about oneshot and use a synchronous loop. {ikw} (import-keywords) started a trend, but LeiPmdir (parallel Maildir) is an upcoming WQ class that will follow this idea. Eventually, {l2m} usage may be updated to follow this, too.
2021-05-03lei: simplify workers_start API
In most cases, we just name the worker process based on the command. The only change is for LeiMirror vs "lei add-external --mirror", but I doubt it matters.
2021-04-28lei: simple WQ workers use {wq1} field
This lets us share more code and reduces cognitive overhead when it comes to picking names (because {lsss} was ridiculous). We'll need to ensure the first error set in lei is the actual error we exit with, otherwise things can get confusing and errors may get lost.
2021-04-27lei: standardize on _lei_wq_eof callback for workers
Simplify our internals a little bit.
2021-04-19config: git_config_dump blesses
I don't know if it's worth it to sub (or super)class PublicInbox::Config into something more generic for lei, but this change simplifies a good chunk of lei code that reuses the public-inbox config parsing.
2021-03-28lei: simplify PktOp callers
Provide a consistent ->op_wait_event method instead of forcing callers to loop (or not) at each callsite. This also avoid a leak possibility by avoiding circular references.
2021-03-25lei_mirror: don't show success on failure
While we were exiting with a error code, showing a successful "# mirrored $URL" message is misleading and wrong. Don't show success until everything is complete and the config is written.
2021-03-24lei_mirror: fix circular reference
All of our $lei->workers_start callers can simply rely on that wrapper to do the right thing and pass fields to ->wq_worker_start children, only. This could manifest as a unbound memory growth if somebody is constantly mirroring, and was causing tests to get stuck when experimenting with a persistent lei-daemon for the entire test suite.
2021-03-24lei: improve management around short-lived workers
Instead of creating a short-lived circular reference, ensure they don't exist in the first place. Note the following changes to hold an extra ref to $sto: - $self->_lei_store(1)->write_prepare($self); + my $sto = $self->_lei_store(1); + $sto->write_prepare($self); I'm not a perlguts expert, but I actually wanted to switch to the one-line version for LeiImport, but xt/lei-auth-fail.t was getting stuck for some reason. It seems the extra ref to the LeiStore ($sto) object is necessary.
2021-03-24lei: hide *_atfork_child from command-line
Otherwise we could get non-sensical results if somebody tries running "lei atfork_child" from the command-line.
2021-02-24lei: avoid needless env passing to subcommands
We already localize %ENV before calling dispatch(), so it's needless overhead in spawn() to be checking env for undef values in those cases.
2021-02-18lei: consolidate the bulk of the IPC code
The backends for "lei add-external --mirror", "lei convert", and "lei import" all share a similar pattern for spawning background workers. Hoist out the common parts to slim down our code base a bit. The LeiXSearch and LeiToMail workers for "lei q" remains a the odd duck due to the deep pipelining and parallelization.
2021-02-08lei q: improve remote mboxrd UX + MUA
For early MUA spawners using lock-free outputs, we we need to on the startq pipe to silence progress reporting. For --augment users, we can start the MUA even earlier by creating Maildirs in the pre-augment phase. To improve progress reporting for non-MUA (or late-MUA) spawners, we'll no longer blindly append "--compressed" to the curl(1) command when POST-ing for the gzipped mboxrd. Furthermore, we'll overload stringify ('""') in LeiCurl to ensure the empty -d '' string shows up properly. v2: fix startq waiting with --threads mset_progress is never shown with early MUA spawning, The plan is to still show progress when augmenting and deduping. This fixes all local search cases. A leftover debug bit is dropped, too
2021-02-07ipc: wq_do => wq_io_do
We will have a ->wq_do that doesn't pass FDs for I/O.
2021-02-07lei add-external: handle interrupts with --mirror
This also updates lei_xsearch to follow the same pattern for stopping curl(1) and tail(1) processes it spawns.
2021-02-07lei: add-external --mirror support
This can be useful for users who want to clone and mirror an existing public-inbox. This doesn't have update support, yet, so users will need to run "git fetch && public-inbox-index" for now.