user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
@ 2020-04-15 20:06 Leah Neukirchen
  2020-04-15 21:21 ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Leah Neukirchen @ 2020-04-15 20:06 UTC (permalink / raw)
  To: meta

Hi *,

I wrote a small script today (see below), and maybe it is useful for
you, too.  The task was to incrementally add messages from a
public-inbox V1 and V2 repo on local disk to a Maildir.  Note that
"ssoma sync" only does V1.

impibe encodes the seen blob hashes into Maildir file names, so only
new messages are added on runs afterwards.  Speed is quite good.
The only dependencies are git and perl.

This allows having local Maildir mirrors of public-inbox clones, for
example to use with other Maildir indexers or MUA.  It is way faster
than using NNTP to sync up.

I don't have much Perl experience, so please tell me any problems with
my code.

\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/

use v5.16;
use Sys::Hostname;
use autodie qw(open close fork);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);

if (@ARGV < 2) {
    die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}

my $pi = $ARGV[0];
my $md = $ARGV[1];

my $hostname = hostname;
my $flags = "S";

my %have;

for my $mail (glob("$md/cur/*:2*")) {
    if ($mail =~ /,B=([0-9a-f]{40})/) {
        $have{$1} = 1;
    }
}

my $status = 0;

sub deliver {
    my ($repo, $blob, $time) = @_;
    state $delivery = 0;

    $delivery++;
    # Embed the blob hash into the file name so we can check easily
    # which blobs we have already.
    my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
        $time, $$, $delivery, $hostname, $blob, $flags;

    my $mail = "$md/cur/$name";
                
    my $pid = fork;
    if ($pid == 0) {
        no autodie 'sysopen';
        sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
            die "$0: couldn't create $md/cur/$name: $!\n";
        exec "git", "-C", "$repo", "cat-file", "blob", $blob
    }
    elsif (defined $pid) {
        waitpid $pid, 0;
        if ($? == 0) {
            say $mail;
        } else {
            $status = 1;
        }
    }
}

if (-f "$pi/HEAD") { # V1 format
    open(my $git_fh, "-|", "git", "-C", $pi,
         qw(log --reverse --raw --format=%H:%at --no-abbrev));
    my $time;
    while (<$git_fh>) {
        chomp;
        if (/^[0-9a-f]{40}:(\d+)$/) {
            $time = $1;
        }
        elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
            my $blob = $1;

            next  if $have{$blob};
            deliver($pi, $blob, $time);
        }   
    }
    close($git_fh);
} else { # V2 format
    my @repos = glob("$pi/[0-9]*.git");

    unless (@repos) {
        die "$0: no V2 *.git repositories found.\n";
    }

    for my $repo (@repos) {
        my %authored_at;
        
        # We store the commit dates per tree(!), not per commit,
        # because "git rev-list --objects" will print only the tree.
        # But trees are as unique as messages.
        open(my $git_fh, "-|", "git", "-C", $repo,
             qw(log --pretty=%T:%at @));
        while (<$git_fh>) {
            chomp;
            my ($ref, $time) = split(':');
            $authored_at{$ref} = $time;
        }
        close($git_fh);
        
        open($git_fh, "-|", "git", "-C", $repo,
             qw(rev-list --reverse --objects @));
        my $time;
        my $blob;
        while (<$git_fh>) {
            chomp;
            if (/ $/) {
                $time = $authored_at{$`};
            }
            elsif (/ m$/) {
                $blob = $`;
                
                next  if $have{$blob};
                deliver($repo, $blob, $time);
            }
            # XXX 'd' (= deleted) messages are not respected.
        }
        
        close($git_fh);
    }
}

exit $status;
\f

Enjoy,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-15 21:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
2020-04-15 21:21 ` Eric Wong
2020-04-15 21:48   ` Leah Neukirchen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).