user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
@ 2020-04-15 20:06 Leah Neukirchen
  2020-04-15 21:21 ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Leah Neukirchen @ 2020-04-15 20:06 UTC (permalink / raw)
  To: meta

Hi *,

I wrote a small script today (see below), and maybe it is useful for
you, too.  The task was to incrementally add messages from a
public-inbox V1 and V2 repo on local disk to a Maildir.  Note that
"ssoma sync" only does V1.

impibe encodes the seen blob hashes into Maildir file names, so only
new messages are added on runs afterwards.  Speed is quite good.
The only dependencies are git and perl.

This allows having local Maildir mirrors of public-inbox clones, for
example to use with other Maildir indexers or MUA.  It is way faster
than using NNTP to sync up.

I don't have much Perl experience, so please tell me any problems with
my code.

\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/

use v5.16;
use Sys::Hostname;
use autodie qw(open close fork);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);

if (@ARGV < 2) {
    die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}

my $pi = $ARGV[0];
my $md = $ARGV[1];

my $hostname = hostname;
my $flags = "S";

my %have;

for my $mail (glob("$md/cur/*:2*")) {
    if ($mail =~ /,B=([0-9a-f]{40})/) {
        $have{$1} = 1;
    }
}

my $status = 0;

sub deliver {
    my ($repo, $blob, $time) = @_;
    state $delivery = 0;

    $delivery++;
    # Embed the blob hash into the file name so we can check easily
    # which blobs we have already.
    my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
        $time, $$, $delivery, $hostname, $blob, $flags;

    my $mail = "$md/cur/$name";
                
    my $pid = fork;
    if ($pid == 0) {
        no autodie 'sysopen';
        sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
            die "$0: couldn't create $md/cur/$name: $!\n";
        exec "git", "-C", "$repo", "cat-file", "blob", $blob
    }
    elsif (defined $pid) {
        waitpid $pid, 0;
        if ($? == 0) {
            say $mail;
        } else {
            $status = 1;
        }
    }
}

if (-f "$pi/HEAD") { # V1 format
    open(my $git_fh, "-|", "git", "-C", $pi,
         qw(log --reverse --raw --format=%H:%at --no-abbrev));
    my $time;
    while (<$git_fh>) {
        chomp;
        if (/^[0-9a-f]{40}:(\d+)$/) {
            $time = $1;
        }
        elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
            my $blob = $1;

            next  if $have{$blob};
            deliver($pi, $blob, $time);
        }   
    }
    close($git_fh);
} else { # V2 format
    my @repos = glob("$pi/[0-9]*.git");

    unless (@repos) {
        die "$0: no V2 *.git repositories found.\n";
    }

    for my $repo (@repos) {
        my %authored_at;
        
        # We store the commit dates per tree(!), not per commit,
        # because "git rev-list --objects" will print only the tree.
        # But trees are as unique as messages.
        open(my $git_fh, "-|", "git", "-C", $repo,
             qw(log --pretty=%T:%at @));
        while (<$git_fh>) {
            chomp;
            my ($ref, $time) = split(':');
            $authored_at{$ref} = $time;
        }
        close($git_fh);
        
        open($git_fh, "-|", "git", "-C", $repo,
             qw(rev-list --reverse --objects @));
        my $time;
        my $blob;
        while (<$git_fh>) {
            chomp;
            if (/ $/) {
                $time = $authored_at{$`};
            }
            elsif (/ m$/) {
                $blob = $`;
                
                next  if $have{$blob};
                deliver($repo, $blob, $time);
            }
            # XXX 'd' (= deleted) messages are not respected.
        }
        
        close($git_fh);
    }
}

exit $status;
\f

Enjoy,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
  2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
@ 2020-04-15 21:21 ` Eric Wong
  2020-04-15 21:48   ` Leah Neukirchen
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2020-04-15 21:21 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi *,
> 
> I wrote a small script today (see below), and maybe it is useful for
> you, too.  The task was to incrementally add messages from a
> public-inbox V1 and V2 repo on local disk to a Maildir.  Note that
> "ssoma sync" only does V1.
> 
> impibe encodes the seen blob hashes into Maildir file names, so only
> new messages are added on runs afterwards.  Speed is quite good.
> The only dependencies are git and perl.
> 
> This allows having local Maildir mirrors of public-inbox clones, for
> example to use with other Maildir indexers or MUA.  It is way faster
> than using NNTP to sync up.

Cool.  There's some latency introduced by the NNTP client/server
and maybe parallelizing/pipelining can help, a bit; but raw
access to git will always be faster.

> I don't have much Perl experience, so please tell me any problems with
> my code.

No problem.  Nothing really Perl-specific, most of my comments
would apply to other languages.

> \f
> #!/usr/bin/perl -w
> # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
> #
> # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
> # all copyright and related or neighboring rights to this work.
> #
> # http://creativecommons.org/publicdomain/zero/1.0/
> 
> use v5.16;
> use Sys::Hostname;
> use autodie qw(open close fork);
> use Fcntl qw(O_WRONLY O_CREAT O_EXCL);
> 
> if (@ARGV < 2) {
>     die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
> }
> 
> my $pi = $ARGV[0];
> my $md = $ARGV[1];
> 
> my $hostname = hostname;
> my $flags = "S";
> 
> my %have;
> 
> for my $mail (glob("$md/cur/*:2*")) {
>     if ($mail =~ /,B=([0-9a-f]{40})/) {
>         $have{$1} = 1;
>     }
> }

That glob is going to be a problem with bigger Maildirs; but
then again giant Maildirs aren't great at all.

GLOB_NOSORT can alleviate some of the slowness:
  https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/

opendir + readdir may also be used (probably slower with more
ops, but it would use less memory).

Maildir allows $md/$foo_state_file(s), and I'd probably
store the most recent rev-list/log state there, instead, (more
below)

> my $status = 0;
> 
> sub deliver {
>     my ($repo, $blob, $time) = @_;
>     state $delivery = 0;
> 
>     $delivery++;
>     # Embed the blob hash into the file name so we can check easily
>     # which blobs we have already.
>     my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
>         $time, $$, $delivery, $hostname, $blob, $flags;
> 
>     my $mail = "$md/cur/$name";
>                 
>     my $pid = fork;
>     if ($pid == 0) {
>         no autodie 'sysopen';
>         sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) ||
>             die "$0: couldn't create $md/cur/$name: $!\n";
>         exec "git", "-C", "$repo", "cat-file", "blob", $blob
>     }
>     elsif (defined $pid) {
>         waitpid $pid, 0;
>         if ($? == 0) {
>             say $mail;
>         } else {
>             $status = 1;
>         }
>     }

It's possible for an MUA to see a partially-written file in
$md/cur/$name with the above.

I seem to recall the correct way to write to Maildirs being:

1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
2) only the MUA is allowed to rename() files from /new/ => /cur/

That would require more globbing, or capturing the most recent
commit hash and storing it in $foo_state_file

> if (-f "$pi/HEAD") { # V1 format
>     open(my $git_fh, "-|", "git", "-C", $pi,
>          qw(log --reverse --raw --format=%H:%at --no-abbrev));

Not sure what --reverse buys, here; but this is another
place where things get slower as more messages are added.

Storing the most recent commit hash would allow avoiding the
memory overhead of giant %have and %authored_at tables; that
will improve fork() performance on Linux.

Being able to use shorter filenames might save some space
in the kernel dentry cache, too :)

>     my $time;
>     while (<$git_fh>) {
>         chomp;
>         if (/^[0-9a-f]{40}:(\d+)$/) {
>             $time = $1;
>         }
>         elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) {
>             my $blob = $1;
> 
>             next  if $have{$blob};
>             deliver($pi, $blob, $time);
>         }   
>     }
>     close($git_fh);
> } else { # V2 format
>     my @repos = glob("$pi/[0-9]*.git");

Keep in mind that's lexically sorted, so order becomes
unexpected when 10.git rolls around (LKML is at 8.git, now)
I don't believe order really matters, though...

>     unless (@repos) {
>         die "$0: no V2 *.git repositories found.\n";
>     }
> 
>     for my $repo (@repos) {
>         my %authored_at;
>         
>         # We store the commit dates per tree(!), not per commit,
>         # because "git rev-list --objects" will print only the tree.
>         # But trees are as unique as messages.
>         open(my $git_fh, "-|", "git", "-C", $repo,
>              qw(log --pretty=%T:%at @));
>         while (<$git_fh>) {
>             chomp;
>             my ($ref, $time) = split(':');
>             $authored_at{$ref} = $time;
>         }
>         close($git_fh);

That's a bit odd and seems unnecessary.  Using
(log --raw --format=%H:%at --no-abbrev) similar to how you do
with the v1 code should still work for v2.

>         open($git_fh, "-|", "git", "-C", $repo,
>              qw(rev-list --reverse --objects @));
>         my $time;
>         my $blob;
>         while (<$git_fh>) {
>             chomp;
>             if (/ $/) {
>                 $time = $authored_at{$`};
>             }
>             elsif (/ m$/) {
>                 $blob = $`;
>                 
>                 next  if $have{$blob};
>                 deliver($repo, $blob, $time);

Being an old Perl user myself, $` is one of those things I
quickly learned to avoid (and promptly forgot the exact meaning
of, so I had to check the perlvar manpage).  It appears the
performance problems associated with $` is no longer relevant in
newer Perls.

However, using /([0-9a-f]{40})/ and $1 as you do with the v1
code can improve readability a bit.

In general, I think the v2 code path should be more similar to
the v1 path.

And I've been meaning to change {40} to {40,} in places to
deal with the git SHA-256 (and beyond) support.

>             }
>             # XXX 'd' (= deleted) messages are not respected.
>         }
>         
>         close($git_fh);
>     }
> }
> 
> exit $status;
> \f
> 
> Enjoy,

Thanks for sharing!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
  2020-04-15 21:21 ` Eric Wong
@ 2020-04-15 21:48   ` Leah Neukirchen
  0 siblings, 0 replies; 3+ messages in thread
From: Leah Neukirchen @ 2020-04-15 21:48 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@yhbt.net> writes:

> Cool.  There's some latency introduced by the NNTP client/server
> and maybe parallelizing/pipelining can help, a bit; but raw
> access to git will always be faster.

It also transfers quicker due to packing etc.

>> for my $mail (glob("$md/cur/*:2*")) {
>>     if ($mail =~ /,B=([0-9a-f]{40})/) {
>>         $have{$1} = 1;
>>     }
>> }
>
> That glob is going to be a problem with bigger Maildirs; but
> then again giant Maildirs aren't great at all.
>
> GLOB_NOSORT can alleviate some of the slowness:
>   https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/

Good point, I implemented this.

> opendir + readdir may also be used (probably slower with more
> ops, but it would use less memory).

For my 200k+ INBOX, Perl needs 64M for the glob; negligible these days.
But GLOB_NOSORT saves almost 30%. 

> Maildir allows $md/$foo_state_file(s), and I'd probably
> store the most recent rev-list/log state there, instead, (more
> below)

Yes, with more explicit state everything gets easier, seemingly. :)
I wanna avoid it now.  (Also, the Git reflog can be used to only
look at changes since last fetch.)

> It's possible for an MUA to see a partially-written file in
> $md/cur/$name with the above.
>
> I seem to recall the correct way to write to Maildirs being:
>
> 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
> 2) only the MUA is allowed to rename() files from /new/ => /cur/

True, this will need a bit more code.

>> if (-f "$pi/HEAD") { # V1 format
>>     open(my $git_fh, "-|", "git", "-C", $pi,
>>          qw(log --reverse --raw --format=%H:%at --no-abbrev));
>
> Not sure what --reverse buys, here; but this is another
> place where things get slower as more messages are added.

It will write the files in historical order, perhaps that yields some
performance benefits later.

> Keep in mind that's lexically sorted, so order becomes
> unexpected when 10.git rolls around (LKML is at 8.git, now)
> I don't believe order really matters, though...

I don't think it matters.

> That's a bit odd and seems unnecessary.  Using
> (log --raw --format=%H:%at --no-abbrev) similar to how you do
> with the v1 code should still work for v2.

Yes, I wrote V2 first and then realized V1 is easier and works just as well.

Current version:

\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/

use v5.16;
use Sys::Hostname;
use File::Glob ':bsd_glob';
use autodie qw(open close);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);

if (@ARGV < 2) {
    die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}

my ($pi, $md) = @ARGV;

my $hostname = hostname;
my $flags = "S";

my $status = 0;

my %have;

for my $mail (bsd_glob("$md/cur/*:2*", GLOB_NOSORT)) {
    if ($mail =~ /,B=([0-9a-f]{40})/) {
        $have{$1} = 1;
    }
}

sub deliver {
    my ($repo, $blob, $time) = @_;
    state $delivery = 0;

    $delivery++;
    # Embed the blob hash into the file name so we can check easily
    # which blobs we have already.
    my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
        $time, $$, $delivery, $hostname, $blob, $flags;

    my $mail = "$md/cur/$name";
                
    my $pid = fork;
    if ($pid == 0) {
        sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL)
            or die "$0: couldn't create $md/cur/$name: $!\n";
        exec "git", "-C", "$repo", "cat-file", "blob", $blob
            or die "$0: couldn't exec git: $!\n";
    }
    else {
        waitpid $pid, 0;
        if ($? == 0) {
            say $mail;
        } else {
            $status = 1;
        }
    }
}

my @repos;

if (-f "$pi/HEAD") {            # V1 format
    push @repos, $pi;
} else {                        # V2 format
    push @repos, bsd_glob("$pi/[0-9]*.git");
    unless (@repos) {
        die "$0: no V2 *.git repositories found.\n";
    }
}

for my $repo (@repos) {
    open(my $git_fh, "-|", "git", "-C", $repo,
         qw(log --reverse --raw --format=%H:%at --no-abbrev));
    my $time;
    while (<$git_fh>) {
        chomp;
        if (/^[\da-f]{40}:(\d+)$/) {
            $time = $1;
        }
        elsif (/^:\d{6} \d+ [\da-f]+ ([\da-f]{40}) ([AM]\tm$|A\t[\da-f]{2}\/)/) {
            my $blob = $1;
            next  if $have{$blob};
            deliver($repo, $blob, $time);
        }
    }
    close($git_fh);
}

exit $status;
\f

cu,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-15 21:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
2020-04-15 21:21 ` Eric Wong
2020-04-15 21:48   ` Leah Neukirchen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).