user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Leah Neukirchen <leah@vuxu.org>
To: meta@public-inbox.org
Subject: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
Date: Wed, 15 Apr 2020 22:06:59 +0200	[thread overview]
Message-ID: <87zhbcldi4.fsf@vuxu.org> (raw)

Hi *,

I wrote a small script today (see below), and maybe it is useful for
you, too.  The task was to incrementally add messages from a
public-inbox V1 and V2 repo on local disk to a Maildir.  Note that
"ssoma sync" only does V1.

impibe encodes the seen blob hashes into Maildir file names, so only
new messages are added on runs afterwards.  Speed is quite good.
The only dependencies are git and perl.

This allows having local Maildir mirrors of public-inbox clones, for
example to use with other Maildir indexers or MUA.  It is way faster
than using NNTP to sync up.

I don't have much Perl experience, so please tell me any problems with
my code.

\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/

use v5.16;
use Sys::Hostname;
use autodie qw(open close fork);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);

if (@ARGV < 2) {
    die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}

my $pi = $ARGV[0];
my $md = $ARGV[1];

my $hostname = hostname;
my $flags = "S";

my %have;

for my $mail (glob("$md/cur/*:2*")) {
    if ($mail =~ /,B=([0-9a-f]{40})/) {
        $have{$1} = 1;
    }
}

my $status = 0;

sub deliver {
    my ($repo, $blob, $time) = @_;
    state $delivery = 0;

    $delivery++;
    # Embed the blob hash into the file name so we can check easily
    # which blobs we have already.
    my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
        $time, $$, $delivery, $hostname, $blob, $flags;

    my $mail = "$md/cur/$name";
                
    my $pid = fork;
    if ($pid == 0) {
        no autodie 'sysopen';
        sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
            die "$0: couldn't create $md/cur/$name: $!\n";
        exec "git", "-C", "$repo", "cat-file", "blob", $blob
    }
    elsif (defined $pid) {
        waitpid $pid, 0;
        if ($? == 0) {
            say $mail;
        } else {
            $status = 1;
        }
    }
}

if (-f "$pi/HEAD") { # V1 format
    open(my $git_fh, "-|", "git", "-C", $pi,
         qw(log --reverse --raw --format=%H:%at --no-abbrev));
    my $time;
    while (<$git_fh>) {
        chomp;
        if (/^[0-9a-f]{40}:(\d+)$/) {
            $time = $1;
        }
        elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
            my $blob = $1;

            next  if $have{$blob};
            deliver($pi, $blob, $time);
        }   
    }
    close($git_fh);
} else { # V2 format
    my @repos = glob("$pi/[0-9]*.git");

    unless (@repos) {
        die "$0: no V2 *.git repositories found.\n";
    }

    for my $repo (@repos) {
        my %authored_at;
        
        # We store the commit dates per tree(!), not per commit,
        # because "git rev-list --objects" will print only the tree.
        # But trees are as unique as messages.
        open(my $git_fh, "-|", "git", "-C", $repo,
             qw(log --pretty=%T:%at @));
        while (<$git_fh>) {
            chomp;
            my ($ref, $time) = split(':');
            $authored_at{$ref} = $time;
        }
        close($git_fh);
        
        open($git_fh, "-|", "git", "-C", $repo,
             qw(rev-list --reverse --objects @));
        my $time;
        my $blob;
        while (<$git_fh>) {
            chomp;
            if (/ $/) {
                $time = $authored_at{$`};
            }
            elsif (/ m$/) {
                $blob = $`;
                
                next  if $have{$blob};
                deliver($repo, $blob, $time);
            }
            # XXX 'd' (= deleted) messages are not respected.
        }
        
        close($git_fh);
    }
}

exit $status;
\f

Enjoy,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

             reply	other threads:[~2020-04-15 20:07 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-15 20:06 Leah Neukirchen [this message]
2020-04-15 21:21 ` impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Eric Wong
2020-04-15 21:48   ` Leah Neukirchen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zhbcldi4.fsf@vuxu.org \
    --to=leah@vuxu.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).