user/dev discussion of public-inbox itself
 help / color / Atom feed
From: Leah Neukirchen <leah@vuxu.org>
To: Eric Wong <e@yhbt.net>
Cc: meta@public-inbox.org
Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
Date: Wed, 15 Apr 2020 23:48:26 +0200
Message-ID: <87v9m0l8t1.fsf@vuxu.org> (raw)
In-Reply-To: <20200415212106.GA6284@dcvr> (Eric Wong's message of "Wed, 15 Apr 2020 21:21:06 +0000")

Eric Wong <e@yhbt.net> writes:

> Cool.  There's some latency introduced by the NNTP client/server
> and maybe parallelizing/pipelining can help, a bit; but raw
> access to git will always be faster.

It also transfers quicker due to packing etc.

>> for my $mail (glob("$md/cur/*:2*")) {
>>     if ($mail =~ /,B=([0-9a-f]{40})/) {
>>         $have{$1} = 1;
>>     }
>> }
>
> That glob is going to be a problem with bigger Maildirs; but
> then again giant Maildirs aren't great at all.
>
> GLOB_NOSORT can alleviate some of the slowness:
>   https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/

Good point, I implemented this.

> opendir + readdir may also be used (probably slower with more
> ops, but it would use less memory).

For my 200k+ INBOX, Perl needs 64M for the glob; negligible these days.
But GLOB_NOSORT saves almost 30%. 

> Maildir allows $md/$foo_state_file(s), and I'd probably
> store the most recent rev-list/log state there, instead, (more
> below)

Yes, with more explicit state everything gets easier, seemingly. :)
I wanna avoid it now.  (Also, the Git reflog can be used to only
look at changes since last fetch.)

> It's possible for an MUA to see a partially-written file in
> $md/cur/$name with the above.
>
> I seem to recall the correct way to write to Maildirs being:
>
> 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
> 2) only the MUA is allowed to rename() files from /new/ => /cur/

True, this will need a bit more code.

>> if (-f "$pi/HEAD") { # V1 format
>>     open(my $git_fh, "-|", "git", "-C", $pi,
>>          qw(log --reverse --raw --format=%H:%at --no-abbrev));
>
> Not sure what --reverse buys, here; but this is another
> place where things get slower as more messages are added.

It will write the files in historical order, perhaps that yields some
performance benefits later.

> Keep in mind that's lexically sorted, so order becomes
> unexpected when 10.git rolls around (LKML is at 8.git, now)
> I don't believe order really matters, though...

I don't think it matters.

> That's a bit odd and seems unnecessary.  Using
> (log --raw --format=%H:%at --no-abbrev) similar to how you do
> with the v1 code should still work for v2.

Yes, I wrote V2 first and then realized V1 is easier and works just as well.

Current version:

\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/

use v5.16;
use Sys::Hostname;
use File::Glob ':bsd_glob';
use autodie qw(open close);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);

if (@ARGV < 2) {
    die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}

my ($pi, $md) = @ARGV;

my $hostname = hostname;
my $flags = "S";

my $status = 0;

my %have;

for my $mail (bsd_glob("$md/cur/*:2*", GLOB_NOSORT)) {
    if ($mail =~ /,B=([0-9a-f]{40})/) {
        $have{$1} = 1;
    }
}

sub deliver {
    my ($repo, $blob, $time) = @_;
    state $delivery = 0;

    $delivery++;
    # Embed the blob hash into the file name so we can check easily
    # which blobs we have already.
    my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
        $time, $$, $delivery, $hostname, $blob, $flags;

    my $mail = "$md/cur/$name";
                
    my $pid = fork;
    if ($pid == 0) {
        sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL)
            or die "$0: couldn't create $md/cur/$name: $!\n";
        exec "git", "-C", "$repo", "cat-file", "blob", $blob
            or die "$0: couldn't exec git: $!\n";
    }
    else {
        waitpid $pid, 0;
        if ($? == 0) {
            say $mail;
        } else {
            $status = 1;
        }
    }
}

my @repos;

if (-f "$pi/HEAD") {            # V1 format
    push @repos, $pi;
} else {                        # V2 format
    push @repos, bsd_glob("$pi/[0-9]*.git");
    unless (@repos) {
        die "$0: no V2 *.git repositories found.\n";
    }
}

for my $repo (@repos) {
    open(my $git_fh, "-|", "git", "-C", $repo,
         qw(log --reverse --raw --format=%H:%at --no-abbrev));
    my $time;
    while (<$git_fh>) {
        chomp;
        if (/^[\da-f]{40}:(\d+)$/) {
            $time = $1;
        }
        elsif (/^:\d{6} \d+ [\da-f]+ ([\da-f]{40}) ([AM]\tm$|A\t[\da-f]{2}\/)/) {
            my $blob = $1;
            next  if $have{$blob};
            deliver($repo, $blob, $time);
        }
    }
    close($git_fh);
}

exit $status;
\f

cu,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org

      reply index

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-15 20:06 Leah Neukirchen
2020-04-15 21:21 ` Eric Wong
2020-04-15 21:48   ` Leah Neukirchen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v9m0l8t1.fsf@vuxu.org \
    --to=leah@vuxu.org \
    --cc=e@yhbt.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git