about summary refs log tree commit homepage
path: root/script/public-inbox-init
DateCommit message (Collapse)
2024-04-03treewide: avoid getpid() for OnDestroy checks
getpid() isn't cached by glibc nowadays and system calls are more expensive due to CPU vulnerability mitigations. To ensure we switch to the new semantics properly, introduce a new `on_destroy' function to simplify callers. Furthermore, most OnDestroy correctness is often tied to the process which creates it, so make the new API default to guarded against running in subprocesses. For cases which require running in all children, a new PublicInbox::OnDestroy::all call is provided.
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-10-18init: use autodie to reduce distractions
This hurts startup time a bit, but our tests use run_script by default and I don't think normal users call -init enough to care.
2023-10-18init: drop extraneous `+'
It's actually valid Perl syntax, but still confusing to look at. Fixes: add90b9504f4 ("support -C (chdir) for most non-daemon commands")
2023-10-18use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads. This saves us some code to reduce cognitive overhead for readers. We'll also support reusing a destination buffers so it can work more nicely with existing code.
2023-08-28public-inbox-init: honor umask when creating config file
Creating config 0600 disregarding umask breaks scenarios where daemons run with credentials different from config owner (but need to read the config). File::Temp defaults to 0600, which is unsuitable for the recommended/typical scenario of daemons running unprivileged and with UID different from $PI_CONFIG owner, as the deamons need to read $PI_CONFIG. Respecting umask might end up creating world-unreadable config, too, but for people who use such umask that's expected behavior.
2021-11-02init: respect umask when creating description
I noticed a description for a new inbox had st_mode=0600.
2021-09-15support -C (chdir) for most non-daemon commands
Because make(1), git(1), tar(1) all support -C in this form, as do our newer commands such as lei, public-inbox-{clone,fetch}.
2021-09-12init: set a useful description
"Unnamed repository" for v1 inboxes was misleading, and having a non-existent description for v2 was equally annoying, so set a short description based on the primary address. We remove descriptions when setting up new test inboxes to preserve the behavior of the t/lei-mirror.t test case.
2021-08-11treewide: use *nix-specific dirname regexps
None of our code elsewhere accounts for non-*nix pathnames and it's not worth our time to start. So stop wasting CPU cycles giving the illusion that we'd care about non-*nix pathnames.
2021-07-25init: support git <2.30 for "-c KEY=VALUE" args
It turns out `--fixed-value' is a relatively new git-config(1) feature in git 2.30+ (December 2020). So use the quotemeta perlop for now since it seems compatible-enough for POSIX ERE used by git.
2021-07-22init: allow arbitrary key-values via -c KEY=VALUE
This won't blindly append identical key=values, but allows specifying multiple, different key=value pairs as long as the values are different.
2021-03-28treewide: shorten temporary filename
File::Temp only requires four 'X' characters (unlike mkstemp(3), which requires six). So only so only give it 4 to avoid an 80-column violation and maybe save metadata space on FSes.
2021-02-07init: lowercase -j for --jobs
This is taken from common implementations of make(1) and only affected people using the command-line help output.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01on_destroy: support PID owner guard
Since we'll be forking for Xapian indexing and maybe other places, having a simple guard in place to ensure OnDestroy doesn't unexpectedly unlink files or similar is a safer option.
2021-01-01init: remove embedded UnlinkMe package
PublicInbox::OnDestroy can do the same thing
2021-01-01spawn: move run_die here from PublicInbox::Import
It seems like a more logical place for it, but we'll favor the newly-added xsys_e() in tests for BAIL_OUT use.
2020-12-28check defined return value for localized slurp errors
Reading from regular files (even on STDIN) can fail when dealing with flakey storage.
2020-12-26init: use the return value of rel2abs_collapsed
:x Fixes: 9fcce78e40b0a7c6 ("script/public-inbox-*: favor caller-provided pathnames")
2020-12-21use rel2abs_collapsed when loading Inbox objects
We need to canonicalize paths for inboxes which do not have a newsgroup defined, otherwise ->eidx_key matches can fail in unexpected ways.
2020-12-20script/public-inbox-*: favor caller-provided pathnames
We'll try to avoid calling Cwd::abs_path and use File::Spec->rel2abs instead, since abs_path will resolve symlinks the user specified on the command-line. Unfortunately, ->rel2abs still leaves "/.." and "/../" uncollapsed, so we still need to fall back to Cwd::abs_path in those cases. While we are at it, we'll also resolve inboxdir from deep inside v2 directories instead of misdetecting them as v1 bare git repos. In any case, stop matching directories by name and instead rely on the unique combination of st_dev + st_ino on stat() as we started doing in the extindex code.
2020-09-02init+convert: create non-existing directory hierarchies
Following "git init" as an example, we'll create every parent path up to the one specified, instead of attempting to continue on when Cwd::abs_path returns `undef'.
2020-09-02script/*: fold $usage into $help, support `-h' instead of -?
`-h' doesn't conflict with anything, and some users (including git users) may be more accustomed to using it rather than the rarely-seen-outside-of-Getopt::Long `-?' switch. We can also rely on the GetOptions() function to emit a proper error message instead of just "bad command-line args".
2020-08-20init+index: support --skip-docdata for Xapian
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-20init: drop -N alias for --skip-artnum
It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine.
2020-08-20init: support --newsgroup option
We can reduce the need to edit the config file for NNTP group names this way.
2020-08-20init: support --help and -?
And speed those up with some lazy loading, too.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10avoid File::Temp::tempfile in more places
We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free.
2020-07-26t/init.t: don't modify ~/.public-inbox/
Tests for failures should not leave junk temporary files lying around in a users' ~/.public-inbox/. On a side note, I'm not sure if PI_DIR is or was ever necessary. It's never been documented, so perhaps using $HOME for this is better...
2020-07-17config: reject `\n' in `inboxdir'
"\n" and other characters requiring quoting and/or escaping in in $GIT_DIR/objects/info/alternates was not supported in git 2.11 and earlier; nor does it seem supported at all in libgit2. This will allow us to support sharing git-cat-file or similar endpoints across multiple inboxes via alternates. This breaks an existing use case for anybody wacky enough to put `\n' in the `inboxdir' pathname; but I doubt this affects anybody.
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-23init: refer to inboxes as "inbox" or "inboxes" in errors
Since V2 uses multiple git repositories, stop using the word "repo" when referring to inboxes.
2020-06-23init: add -j / --jobs parameter
On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27init: use Import::run_die instead of system()
We already load PublicInbox::Import via PublicInbox::InboxWritable, so it's not an extra module to load. This can give us a slight speedup in tests.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2019-11-16init: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub. We also need to rely on ->DESTROY instead of END{} to unlink the lock file on sub exit.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-10-05init: implement locking
First, we use flock(2) to wait on parallel public-inbox-init(1) invocations while we make multiple changes using git-config(1). This flock allows -init processes to wait on each other if using reasonable POSIX filesystems. Then, we also need a git-config(1)-compatible lock to prevent user-invoked git-config(1) processes from clobbering our changes while we're holding the flock.
2019-10-05init: favor --skip-epoch instead of --skip
Since I intend to add support for --skip-artnum, disambiguating the long option name makes sense. We'll support --skip indefinitely for compatibility.
2019-09-09run update-copyrights from gnulib for 2019
2019-05-23doc: various updates to reflect current state
-index documentation avoid redundant v1 information and refers readers to apropriate v1/v2 manpages. Search::Xapian can also be optional, now, as only the PSGI search interface uses it. Favor "INBOX_DIR" where appropriate, since "REPO_DIR" can be confused for code repos which we also support. XAPIAN_FLUSH_THRESHOLD is documented for all relevant bulk commands.
2019-05-23v1writable: retire in favor of InboxWritable
In retrospect, introducing V1Writable was unnecessary and InboxWritable->importer is in a better position to abstract away differences between v1 and v2 writers. So teach InboxWritable to initialize inboxes and get rid of V1Writable.
2019-05-22init: preserve permissions for git prior to 2.1.0
"git config" did not preserve permissions of the config file it modifies prior to git 2.1.0, so workaround that.
2019-05-15admin: improve warnings and errors for missing modules
Since we lazy-load Xapian now, some errors may become more cryptic or buried. Try to improve that by making Admin show better errors.
2019-05-15lazy load Xapian and make it optional for v2
More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-14v1writable: new wrapper which is closer to v2writable
Import initialization is a little strange from history, but we also can't change it too much because it's technically a public API which external code may rely on... And we may need to support v1 repos indefinitely. This should make it easier to write tests for both formats.
2018-12-28init: allow --skip of old epochs for -V2 repos
This allows archivists to publish incomplete archives with newer mail while allowing "0.git" (or "1.git" and so on) epochs to be added-after-the-fact (without affecting "git clone" followers). A reindex will be necessary for Xapian and SQLite to catch up once the old epochs are added; but the reindexing code is also capable of tolerating missing epochs.