From: Eric Wong <e@yhbt.net>
To: Leah Neukirchen <leah@vuxu.org>
Cc: meta@public-inbox.org
Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
Date: Wed, 15 Apr 2020 21:21:06 +0000 [thread overview]
Message-ID: <20200415212106.GA6284@dcvr> (raw)
In-Reply-To: <87zhbcldi4.fsf@vuxu.org>
Leah Neukirchen <leah@vuxu.org> wrote:
> Hi *,
>
> I wrote a small script today (see below), and maybe it is useful for
> you, too. The task was to incrementally add messages from a
> public-inbox V1 and V2 repo on local disk to a Maildir. Note that
> "ssoma sync" only does V1.
>
> impibe encodes the seen blob hashes into Maildir file names, so only
> new messages are added on runs afterwards. Speed is quite good.
> The only dependencies are git and perl.
>
> This allows having local Maildir mirrors of public-inbox clones, for
> example to use with other Maildir indexers or MUA. It is way faster
> than using NNTP to sync up.
Cool. There's some latency introduced by the NNTP client/server
and maybe parallelizing/pipelining can help, a bit; but raw
access to git will always be faster.
> I don't have much Perl experience, so please tell me any problems with
> my code.
No problem. Nothing really Perl-specific, most of my comments
would apply to other languages.
> \f
> #!/usr/bin/perl -w
> # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
> #
> # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
> # all copyright and related or neighboring rights to this work.
> #
> # http://creativecommons.org/publicdomain/zero/1.0/
>
> use v5.16;
> use Sys::Hostname;
> use autodie qw(open close fork);
> use Fcntl qw(O_WRONLY O_CREAT O_EXCL);
>
> if (@ARGV < 2) {
> die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
> }
>
> my $pi = $ARGV[0];
> my $md = $ARGV[1];
>
> my $hostname = hostname;
> my $flags = "S";
>
> my %have;
>
> for my $mail (glob("$md/cur/*:2*")) {
> if ($mail =~ /,B=([0-9a-f]{40})/) {
> $have{$1} = 1;
> }
> }
That glob is going to be a problem with bigger Maildirs; but
then again giant Maildirs aren't great at all.
GLOB_NOSORT can alleviate some of the slowness:
https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/
opendir + readdir may also be used (probably slower with more
ops, but it would use less memory).
Maildir allows $md/$foo_state_file(s), and I'd probably
store the most recent rev-list/log state there, instead, (more
below)
> my $status = 0;
>
> sub deliver {
> my ($repo, $blob, $time) = @_;
> state $delivery = 0;
>
> $delivery++;
> # Embed the blob hash into the file name so we can check easily
> # which blobs we have already.
> my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
> $time, $$, $delivery, $hostname, $blob, $flags;
>
> my $mail = "$md/cur/$name";
>
> my $pid = fork;
> if ($pid == 0) {
> no autodie 'sysopen';
> sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
> die "$0: couldn't create $md/cur/$name: $!\n";
> exec "git", "-C", "$repo", "cat-file", "blob", $blob
> }
> elsif (defined $pid) {
> waitpid $pid, 0;
> if ($? == 0) {
> say $mail;
> } else {
> $status = 1;
> }
> }
It's possible for an MUA to see a partially-written file in
$md/cur/$name with the above.
I seem to recall the correct way to write to Maildirs being:
1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
2) only the MUA is allowed to rename() files from /new/ => /cur/
That would require more globbing, or capturing the most recent
commit hash and storing it in $foo_state_file
> if (-f "$pi/HEAD") { # V1 format
> open(my $git_fh, "-|", "git", "-C", $pi,
> qw(log --reverse --raw --format=%H:%at --no-abbrev));
Not sure what --reverse buys, here; but this is another
place where things get slower as more messages are added.
Storing the most recent commit hash would allow avoiding the
memory overhead of giant %have and %authored_at tables; that
will improve fork() performance on Linux.
Being able to use shorter filenames might save some space
in the kernel dentry cache, too :)
> my $time;
> while (<$git_fh>) {
> chomp;
> if (/^[0-9a-f]{40}:(\d+)$/) {
> $time = $1;
> }
> elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
> my $blob = $1;
>
> next if $have{$blob};
> deliver($pi, $blob, $time);
> }
> }
> close($git_fh);
> } else { # V2 format
> my @repos = glob("$pi/[0-9]*.git");
Keep in mind that's lexically sorted, so order becomes
unexpected when 10.git rolls around (LKML is at 8.git, now)
I don't believe order really matters, though...
> unless (@repos) {
> die "$0: no V2 *.git repositories found.\n";
> }
>
> for my $repo (@repos) {
> my %authored_at;
>
> # We store the commit dates per tree(!), not per commit,
> # because "git rev-list --objects" will print only the tree.
> # But trees are as unique as messages.
> open(my $git_fh, "-|", "git", "-C", $repo,
> qw(log --pretty=%T:%at @));
> while (<$git_fh>) {
> chomp;
> my ($ref, $time) = split(':');
> $authored_at{$ref} = $time;
> }
> close($git_fh);
That's a bit odd and seems unnecessary. Using
(log --raw --format=%H:%at --no-abbrev) similar to how you do
with the v1 code should still work for v2.
> open($git_fh, "-|", "git", "-C", $repo,
> qw(rev-list --reverse --objects @));
> my $time;
> my $blob;
> while (<$git_fh>) {
> chomp;
> if (/ $/) {
> $time = $authored_at{$`};
> }
> elsif (/ m$/) {
> $blob = $`;
>
> next if $have{$blob};
> deliver($repo, $blob, $time);
Being an old Perl user myself, $` is one of those things I
quickly learned to avoid (and promptly forgot the exact meaning
of, so I had to check the perlvar manpage). It appears the
performance problems associated with $` is no longer relevant in
newer Perls.
However, using /([0-9a-f]{40})/ and $1 as you do with the v1
code can improve readability a bit.
In general, I think the v2 code path should be more similar to
the v1 path.
And I've been meaning to change {40} to {40,} in places to
deal with the git SHA-256 (and beyond) support.
> }
> # XXX 'd' (= deleted) messages are not respected.
> }
>
> close($git_fh);
> }
> }
>
> exit $status;
> \f
>
> Enjoy,
Thanks for sharing!
next prev parent reply other threads:[~2020-04-15 21:21 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
2020-04-15 21:21 ` Eric Wong [this message]
2020-04-15 21:48 ` Leah Neukirchen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200415212106.GA6284@dcvr \
--to=e@yhbt.net \
--cc=leah@vuxu.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).