about summary refs log tree commit homepage
path: root/script/public-inbox-convert
DateCommit message (Collapse)
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-11-03treewide: use ->close to call ProcessIO->CLOSE
This will open the door for us to drop `tie' usage from ProcessIO completely in favor of OO method dispatch. While OO method dispatches (e.g. `$fh->close') are slower than normal subroutine calls, it hardly matters in this case since process teardown is a fairly rare operation and we continue to use `close($fh)' for Maildir writes.
2023-10-18convert: use read_all to simplify error checks
There wasn't a need to loop anyways with Perl `read' since the default PerlIO layer will retry.
2023-10-11import: switch to Unix stream socket for fast-import
We use fewer file descriptors and fewer lines of code this way. I'm not aware of any place we rely on POSIX pipe semantics with `git fast-import', and sockets have bigger buffers by default in most cases (even if Linux allows larger pipe buffers).
2023-09-28convert: use ProcessPipe with popen_rd
ProcessPipe->CLOSE will already run waitpid for us and exit on errors, so we can do less, here.
2023-04-07umask: rely on the OnDestroy-based call where applicable
This lets us get rid of some awkwardness around the old API and single-use subroutines while saving us some LoC.
2023-04-06watch: use detect_indexlevel for unconfigured inboxes
I favor leaving the publicinbox.<name>.indexlevel parameter out of config files to make it easier to alter and reduce sources of truth. It worked well in most cases, but public-inbox-watch also needs to detect the indexlevel. Moving the sub to InboxWritable (from Admin) probably makes sense since it's a per-inbox attribute and allows -watch to reuse it.
2023-03-25admin: ensure resolved GIT_DIR is absolute
We'll also support the $base arg of File::Spec->rel2abs since it should make codesearch indexing easier.
2021-09-15support -C (chdir) for most non-daemon commands
Because make(1), git(1), tar(1) all support -C in this form, as do our newer commands such as lei, public-inbox-{clone,fetch}.
2021-09-15multi_git: hoist out common epoch/alternates handling
IMHO, this greatly improves code sharing and organization between v2, extindex, and lei/store. Common git-related logic for these is lightly-refactored and easier to reason about. The impetus for this big change was to ensure inboxes created+managed by public-inbox-{clone,fetch} could have alternates and configs setup properly without depending on SQLite (via V2Writable). This change does that while making old code shorter and better factored.
2021-07-28treewide: s/sequential_shard/sequential-shard/g
The underscore variant was never documented and maintaining the difference between the command-line and internal hash is not worth it.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-25inboxwritable: delay umask_prepare calls
This simplifies all ->with_umask callers and opens the door for further optimizations to delay/elide process spawning.
2020-12-21use rel2abs_collapsed when loading Inbox objects
We need to canonicalize paths for inboxes which do not have a newsgroup defined, otherwise ->eidx_key matches can fail in unexpected ways.
2020-12-20script/public-inbox-*: favor caller-provided pathnames
We'll try to avoid calling Cwd::abs_path and use File::Spec->rel2abs instead, since abs_path will resolve symlinks the user specified on the command-line. Unfortunately, ->rel2abs still leaves "/.." and "/../" uncollapsed, so we still need to fall back to Cwd::abs_path in those cases. While we are at it, we'll also resolve inboxdir from deep inside v2 directories instead of misdetecting them as v1 bare git repos. In any case, stop matching directories by name and instead rely on the unique combination of st_dev + st_ino on stat() as we started doing in the extindex code.
2020-09-02init+convert: create non-existing directory hierarchies
Following "git init" as an example, we'll create every parent path up to the one specified, instead of attempting to continue on when Cwd::abs_path returns `undef'.
2020-09-02script/*: fold $usage into $help, support `-h' instead of -?
`-h' doesn't conflict with anything, and some users (including git users) may be more accustomed to using it rather than the rarely-seen-outside-of-Getopt::Long `-?' switch. We can also rely on the GetOptions() function to emit a proper error message instead of just "bad command-line args".
2020-08-20init+index: support --skip-docdata for Xapian
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-10convert: set No_COW on copied SQLite files
We'll use our existing logic and use sqlite_backup_from_file, which appeared in 1.39 (along with sqlite_backup_to_file).
2020-08-10convert: check ARGV more correctly
Instead of silently ignoring excessive args, don't let a user specify an extra directory. Furthermore, we'll support the odd case where BOFH wants to name an $INBOX_DIR to be `0' :P
2020-08-10convert: speed up --help
Lazy-loading dependencies speeds up --help by several hundred milliseconds and is a huge step towards user-friendliness.
2020-08-10convert: support new -index options
Converting v1 inboxes from v2 can be a painful experience on HDD. Some of the new options in the CLI or config file make it less painful.
2020-05-20convert: describe the release of fast-import pipes
Upon rereading the code, it wasn't immediately obvious to me why we didn't check for errors with `close($w)' instead of relying on `undef'. So add a comment for the benefit of future readers.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-04-20drop needless `eval {}' around Config->new
It hasn't been needed since commit 089cca37fa036411 ("config: ignore missing config files"). And we actually want to propagate errors when we can't start new processes or if git(1) is missing.
2020-02-08convert: preserve indexlevel on conversions
We don't want to blow up users storage too badly when converting v1 to v2 or break because they don't have Xapian bindings installed.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-02convert: fix --no-index switch
The (currently undocumented) "--no-index" flag did not trigger the V2Writable->done call necessary to make the import successful. Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
2020-02-02convert: shift @ARGV explicitly
Relying on implicit "@_" for shift fails with TestCommon::_run_sub iff GetOptions modifies @ARGV.
2020-02-02convert: remove unused variables capturing :from
Looking at git history, they were never used.
2020-01-31convert: preserve highwater mark from v1 msgmap
If we're reusing the msgmap from a v1 inbox, we also need to ensure the highwater mark doesn't get doubled in the v1->v2 conversion by internally triggering the equivalent of "--reindex" on a fresh v2 inbox. This was needed to convert an indexed v1 inbox which featured messages with multiple Message-IDs in it. Fresh, unindexed clones of v1 inboxes would not have been affected by this.
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2019-11-14convert: remove duplicated GetOptions() call
We only need to parse the command-line once.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09run update-copyrights from gnulib for 2019
2019-06-05tighten up digit matches to ASCII for git output
While I don't expect git to suddenly start spewing non-ASCII digits in places I'd expect ASCII, this would make things easier for future hackers and reviewers.
2018-05-11convert+compact: fix when running without ~/.public-inbox/config
Some users may not have any public-inboxes configured, especially in tests.
2018-04-20convert: copy description and git config from v1 repo
I noticed I lost a $GIT_DIR/description in a conversion, so we should preserve it. While we're at it, we ought to copy any config in the old repo to the new one. We will need to warn about cloneurl since it's unfortunately not an automatic process to update. Oh well..
2018-04-07convert: support converting with altid defined
public-inbox-convert ought to be 100% lossless, now
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-01v2: one file, really
We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-30v2: respect core.sharedRepository in git configs
Ensure -convert and -compact do not make repositories unreadable on live servers.
2018-03-30convert: avoid redundant "done\n" statement for fast-import
This bug was hidden due to timing problems with eatmydata or running with tmpfs for TMPDIR.
2018-03-29public-inbox-convert: tool for converting old to new inboxes
This should make it easier to let users perform comparisons and migrate to v2 if needed.