From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 088451F751; Wed, 15 Apr 2020 21:21:07 +0000 (UTC) Date: Wed, 15 Apr 2020 21:21:06 +0000 From: Eric Wong To: Leah Neukirchen Cc: meta@public-inbox.org Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Message-ID: <20200415212106.GA6284@dcvr> References: <87zhbcldi4.fsf@vuxu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87zhbcldi4.fsf@vuxu.org> List-Id: Leah Neukirchen wrote: > Hi *, > > I wrote a small script today (see below), and maybe it is useful for > you, too. The task was to incrementally add messages from a > public-inbox V1 and V2 repo on local disk to a Maildir. Note that > "ssoma sync" only does V1. > > impibe encodes the seen blob hashes into Maildir file names, so only > new messages are added on runs afterwards. Speed is quite good. > The only dependencies are git and perl. > > This allows having local Maildir mirrors of public-inbox clones, for > example to use with other Maildir indexers or MUA. It is way faster > than using NNTP to sync up. Cool. There's some latency introduced by the NNTP client/server and maybe parallelizing/pipelining can help, a bit; but raw access to git will always be faster. > I don't have much Perl experience, so please tell me any problems with > my code. No problem. Nothing really Perl-specific, most of my comments would apply to other languages. > > #!/usr/bin/perl -w > # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir > # > # To the extent possible under law, Leah Neukirchen has waived > # all copyright and related or neighboring rights to this work. > # > # http://creativecommons.org/publicdomain/zero/1.0/ > > use v5.16; > use Sys::Hostname; > use autodie qw(open close fork); > use Fcntl qw(O_WRONLY O_CREAT O_EXCL); > > if (@ARGV < 2) { > die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; > } > > my $pi = $ARGV[0]; > my $md = $ARGV[1]; > > my $hostname = hostname; > my $flags = "S"; > > my %have; > > for my $mail (glob("$md/cur/*:2*")) { > if ($mail =~ /,B=([0-9a-f]{40})/) { > $have{$1} = 1; > } > } That glob is going to be a problem with bigger Maildirs; but then again giant Maildirs aren't great at all. GLOB_NOSORT can alleviate some of the slowness: https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/ opendir + readdir may also be used (probably slower with more ops, but it would use less memory). Maildir allows $md/$foo_state_file(s), and I'd probably store the most recent rev-list/log state there, instead, (more below) > my $status = 0; > > sub deliver { > my ($repo, $blob, $time) = @_; > state $delivery = 0; > > $delivery++; > # Embed the blob hash into the file name so we can check easily > # which blobs we have already. > my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", > $time, $$, $delivery, $hostname, $blob, $flags; > > my $mail = "$md/cur/$name"; > > my $pid = fork; > if ($pid == 0) { > no autodie 'sysopen'; > sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) || > die "$0: couldn't create $md/cur/$name: $!\n"; > exec "git", "-C", "$repo", "cat-file", "blob", $blob > } > elsif (defined $pid) { > waitpid $pid, 0; > if ($? == 0) { > say $mail; > } else { > $status = 1; > } > } It's possible for an MUA to see a partially-written file in $md/cur/$name with the above. I seem to recall the correct way to write to Maildirs being: 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name" 2) only the MUA is allowed to rename() files from /new/ => /cur/ That would require more globbing, or capturing the most recent commit hash and storing it in $foo_state_file > if (-f "$pi/HEAD") { # V1 format > open(my $git_fh, "-|", "git", "-C", $pi, > qw(log --reverse --raw --format=%H:%at --no-abbrev)); Not sure what --reverse buys, here; but this is another place where things get slower as more messages are added. Storing the most recent commit hash would allow avoiding the memory overhead of giant %have and %authored_at tables; that will improve fork() performance on Linux. Being able to use shorter filenames might save some space in the kernel dentry cache, too :) > my $time; > while (<$git_fh>) { > chomp; > if (/^[0-9a-f]{40}:(\d+)$/) { > $time = $1; > } > elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) { > my $blob = $1; > > next if $have{$blob}; > deliver($pi, $blob, $time); > } > } > close($git_fh); > } else { # V2 format > my @repos = glob("$pi/[0-9]*.git"); Keep in mind that's lexically sorted, so order becomes unexpected when 10.git rolls around (LKML is at 8.git, now) I don't believe order really matters, though... > unless (@repos) { > die "$0: no V2 *.git repositories found.\n"; > } > > for my $repo (@repos) { > my %authored_at; > > # We store the commit dates per tree(!), not per commit, > # because "git rev-list --objects" will print only the tree. > # But trees are as unique as messages. > open(my $git_fh, "-|", "git", "-C", $repo, > qw(log --pretty=%T:%at @)); > while (<$git_fh>) { > chomp; > my ($ref, $time) = split(':'); > $authored_at{$ref} = $time; > } > close($git_fh); That's a bit odd and seems unnecessary. Using (log --raw --format=%H:%at --no-abbrev) similar to how you do with the v1 code should still work for v2. > open($git_fh, "-|", "git", "-C", $repo, > qw(rev-list --reverse --objects @)); > my $time; > my $blob; > while (<$git_fh>) { > chomp; > if (/ $/) { > $time = $authored_at{$`}; > } > elsif (/ m$/) { > $blob = $`; > > next if $have{$blob}; > deliver($repo, $blob, $time); Being an old Perl user myself, $` is one of those things I quickly learned to avoid (and promptly forgot the exact meaning of, so I had to check the perlvar manpage). It appears the performance problems associated with $` is no longer relevant in newer Perls. However, using /([0-9a-f]{40})/ and $1 as you do with the v1 code can improve readability a bit. In general, I think the v2 code path should be more similar to the v1 path. And I've been meaning to change {40} to {40,} in places to deal with the git SHA-256 (and beyond) support. > } > # XXX 'd' (= deleted) messages are not respected. > } > > close($git_fh); > } > } > > exit $status; > > > Enjoy, Thanks for sharing!