From 10f31b26e010243ab919dbafeb6f95c6e30640e9 Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Sat, 22 Apr 2023 10:33:42 +0000 Subject: cindex: rewrite prune (again) for speed With my partial git.kernel.org mirror, this brings a full prune down from ~75 minutes to under 5 minutes using git 2.19+. This speedup even applies to users on slow storage (rotational HDD). First off, xapian-delve(1) is nearly 10x faster for dumping boolean terms by prefix than the equivalent Perl code with Xapian bindings. This performance difference is critical since we need to check over 5 million commits for pruning a partial git.kernel.org mirror. We can use sed(1) and sort(1) to massage delve output into something suitable for the first comm(1) input. For the second comm(1) input, the output of `git cat-file --batch-check --batch-all-objects' against all indexed git repos with awk(1) filtering provides the necessary output for generating a list of indexed-but-no-longer accessible commits. sed(1) and awk(1) are POSIX standard tools which can be roughly 2x faster than equivalent Perl for simple filters, while sort(1) is designed to handle larger-than-memory datasets efficiently (unlike the `sort' perlop). With slow storage and git <2.19, the switch to --batch-all-objects actually results in a performance regression since having git perform sorting results in worse disk locality than the previous sequential iteration by Xapian docid. git 2.19+ users with `--unordered' support benefits from improved storage locality; and speedups from storage locality dwarfs the extra overhead of an extra external sort(1) invocation. Even with consumer-grade SATA-II SSDs, the combo of --unordered and sort(1) provides a noticeable speedup since SSD latency remains a factor for --batch-all-objects. git <2.19 users must upgrade git to get acceptable performance on slow storage and giant indexes, but git 2.19 was released nearly 5 years ago so it's probably a reasonable requirement for performance. The only remaining downside of this change for all users the extra temporary disk space for sort(1) and comm(1); but the speedup provided with git 2.19+ is well worth it. --- MANIFEST | 1 + 1 file changed, 1 insertion(+) (limited to 'MANIFEST') diff --git a/MANIFEST b/MANIFEST index c338f0f4..69730623 100644 --- a/MANIFEST +++ b/MANIFEST @@ -160,6 +160,7 @@ lib/PublicInbox/AdminEdit.pm lib/PublicInbox/AltId.pm lib/PublicInbox/AutoReap.pm lib/PublicInbox/Cgit.pm +lib/PublicInbox/CidxComm.pm lib/PublicInbox/CidxLogP.pm lib/PublicInbox/CmdIPC4.pm lib/PublicInbox/CodeSearch.pm -- cgit v1.2.3-24-ge0c7