user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: Leah Neukirchen <leah@vuxu.org>
Cc: meta@public-inbox.org
Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
Date: Wed, 15 Apr 2020 21:21:06 +0000	[thread overview]
Message-ID: <20200415212106.GA6284@dcvr> (raw)
In-Reply-To: <87zhbcldi4.fsf@vuxu.org>

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi *,
> 
> I wrote a small script today (see below), and maybe it is useful for
> you, too.  The task was to incrementally add messages from a
> public-inbox V1 and V2 repo on local disk to a Maildir.  Note that
> "ssoma sync" only does V1.
> 
> impibe encodes the seen blob hashes into Maildir file names, so only
> new messages are added on runs afterwards.  Speed is quite good.
> The only dependencies are git and perl.
> 
> This allows having local Maildir mirrors of public-inbox clones, for
> example to use with other Maildir indexers or MUA.  It is way faster
> than using NNTP to sync up.

Cool.  There's some latency introduced by the NNTP client/server
and maybe parallelizing/pipelining can help, a bit; but raw
access to git will always be faster.

> I don't have much Perl experience, so please tell me any problems with
> my code.

No problem.  Nothing really Perl-specific, most of my comments
would apply to other languages.

> \f
> #!/usr/bin/perl -w
> # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
> #
> # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
> # all copyright and related or neighboring rights to this work.
> #
> # http://creativecommons.org/publicdomain/zero/1.0/
> 
> use v5.16;
> use Sys::Hostname;
> use autodie qw(open close fork);
> use Fcntl qw(O_WRONLY O_CREAT O_EXCL);
> 
> if (@ARGV < 2) {
>     die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
> }
> 
> my $pi = $ARGV[0];
> my $md = $ARGV[1];
> 
> my $hostname = hostname;
> my $flags = "S";
> 
> my %have;
> 
> for my $mail (glob("$md/cur/*:2*")) {
>     if ($mail =~ /,B=([0-9a-f]{40})/) {
>         $have{$1} = 1;
>     }
> }

That glob is going to be a problem with bigger Maildirs; but
then again giant Maildirs aren't great at all.

GLOB_NOSORT can alleviate some of the slowness:
  https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/

opendir + readdir may also be used (probably slower with more
ops, but it would use less memory).

Maildir allows $md/$foo_state_file(s), and I'd probably
store the most recent rev-list/log state there, instead, (more
below)

> my $status = 0;
> 
> sub deliver {
>     my ($repo, $blob, $time) = @_;
>     state $delivery = 0;
> 
>     $delivery++;
>     # Embed the blob hash into the file name so we can check easily
>     # which blobs we have already.
>     my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
>         $time, $$, $delivery, $hostname, $blob, $flags;
> 
>     my $mail = "$md/cur/$name";
>                 
>     my $pid = fork;
>     if ($pid == 0) {
>         no autodie 'sysopen';
>         sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
>             die "$0: couldn't create $md/cur/$name: $!\n";
>         exec "git", "-C", "$repo", "cat-file", "blob", $blob
>     }
>     elsif (defined $pid) {
>         waitpid $pid, 0;
>         if ($? == 0) {
>             say $mail;
>         } else {
>             $status = 1;
>         }
>     }

It's possible for an MUA to see a partially-written file in
$md/cur/$name with the above.

I seem to recall the correct way to write to Maildirs being:

1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
2) only the MUA is allowed to rename() files from /new/ => /cur/

That would require more globbing, or capturing the most recent
commit hash and storing it in $foo_state_file

> if (-f "$pi/HEAD") { # V1 format
>     open(my $git_fh, "-|", "git", "-C", $pi,
>          qw(log --reverse --raw --format=%H:%at --no-abbrev));

Not sure what --reverse buys, here; but this is another
place where things get slower as more messages are added.

Storing the most recent commit hash would allow avoiding the
memory overhead of giant %have and %authored_at tables; that
will improve fork() performance on Linux.

Being able to use shorter filenames might save some space
in the kernel dentry cache, too :)

>     my $time;
>     while (<$git_fh>) {
>         chomp;
>         if (/^[0-9a-f]{40}:(\d+)$/) {
>             $time = $1;
>         }
>         elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
>             my $blob = $1;
> 
>             next  if $have{$blob};
>             deliver($pi, $blob, $time);
>         }   
>     }
>     close($git_fh);
> } else { # V2 format
>     my @repos = glob("$pi/[0-9]*.git");

Keep in mind that's lexically sorted, so order becomes
unexpected when 10.git rolls around (LKML is at 8.git, now)
I don't believe order really matters, though...

>     unless (@repos) {
>         die "$0: no V2 *.git repositories found.\n";
>     }
> 
>     for my $repo (@repos) {
>         my %authored_at;
>         
>         # We store the commit dates per tree(!), not per commit,
>         # because "git rev-list --objects" will print only the tree.
>         # But trees are as unique as messages.
>         open(my $git_fh, "-|", "git", "-C", $repo,
>              qw(log --pretty=%T:%at @));
>         while (<$git_fh>) {
>             chomp;
>             my ($ref, $time) = split(':');
>             $authored_at{$ref} = $time;
>         }
>         close($git_fh);

That's a bit odd and seems unnecessary.  Using
(log --raw --format=%H:%at --no-abbrev) similar to how you do
with the v1 code should still work for v2.

>         open($git_fh, "-|", "git", "-C", $repo,
>              qw(rev-list --reverse --objects @));
>         my $time;
>         my $blob;
>         while (<$git_fh>) {
>             chomp;
>             if (/ $/) {
>                 $time = $authored_at{$`};
>             }
>             elsif (/ m$/) {
>                 $blob = $`;
>                 
>                 next  if $have{$blob};
>                 deliver($repo, $blob, $time);

Being an old Perl user myself, $` is one of those things I
quickly learned to avoid (and promptly forgot the exact meaning
of, so I had to check the perlvar manpage).  It appears the
performance problems associated with $` is no longer relevant in
newer Perls.

However, using /([0-9a-f]{40})/ and $1 as you do with the v1
code can improve readability a bit.

In general, I think the v2 code path should be more similar to
the v1 path.

And I've been meaning to change {40} to {40,} in places to
deal with the git SHA-256 (and beyond) support.

>             }
>             # XXX 'd' (= deleted) messages are not respected.
>         }
>         
>         close($git_fh);
>     }
> }
> 
> exit $status;
> \f
> 
> Enjoy,

Thanks for sharing!

  reply	other threads:[~2020-04-15 21:21 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
2020-04-15 21:21 ` Eric Wong [this message]
2020-04-15 21:48   ` Leah Neukirchen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200415212106.GA6284@dcvr \
    --to=e@yhbt.net \
    --cc=leah@vuxu.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).