From: Leah Neukirchen <leah@vuxu.org>
To: Eric Wong <e@yhbt.net>
Cc: meta@public-inbox.org
Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir
Date: Wed, 15 Apr 2020 23:48:26 +0200 [thread overview]
Message-ID: <87v9m0l8t1.fsf@vuxu.org> (raw)
In-Reply-To: <20200415212106.GA6284@dcvr> (Eric Wong's message of "Wed, 15 Apr 2020 21:21:06 +0000")
Eric Wong <e@yhbt.net> writes:
> Cool. There's some latency introduced by the NNTP client/server
> and maybe parallelizing/pipelining can help, a bit; but raw
> access to git will always be faster.
It also transfers quicker due to packing etc.
>> for my $mail (glob("$md/cur/*:2*")) {
>> if ($mail =~ /,B=([0-9a-f]{40})/) {
>> $have{$1} = 1;
>> }
>> }
>
> That glob is going to be a problem with bigger Maildirs; but
> then again giant Maildirs aren't great at all.
>
> GLOB_NOSORT can alleviate some of the slowness:
> https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/
Good point, I implemented this.
> opendir + readdir may also be used (probably slower with more
> ops, but it would use less memory).
For my 200k+ INBOX, Perl needs 64M for the glob; negligible these days.
But GLOB_NOSORT saves almost 30%.
> Maildir allows $md/$foo_state_file(s), and I'd probably
> store the most recent rev-list/log state there, instead, (more
> below)
Yes, with more explicit state everything gets easier, seemingly. :)
I wanna avoid it now. (Also, the Git reflog can be used to only
look at changes since last fetch.)
> It's possible for an MUA to see a partially-written file in
> $md/cur/$name with the above.
>
> I seem to recall the correct way to write to Maildirs being:
>
> 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name"
> 2) only the MUA is allowed to rename() files from /new/ => /cur/
True, this will need a bit more code.
>> if (-f "$pi/HEAD") { # V1 format
>> open(my $git_fh, "-|", "git", "-C", $pi,
>> qw(log --reverse --raw --format=%H:%at --no-abbrev));
>
> Not sure what --reverse buys, here; but this is another
> place where things get slower as more messages are added.
It will write the files in historical order, perhaps that yields some
performance benefits later.
> Keep in mind that's lexically sorted, so order becomes
> unexpected when 10.git rolls around (LKML is at 8.git, now)
> I don't believe order really matters, though...
I don't think it matters.
> That's a bit odd and seems unnecessary. Using
> (log --raw --format=%H:%at --no-abbrev) similar to how you do
> with the v1 code should still work for v2.
Yes, I wrote V2 first and then realized V1 is easier and works just as well.
Current version:
\f
#!/usr/bin/perl -w
# impibe - incrementally add messages from public-inbox V1/V2 to a Maildir
#
# To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived
# all copyright and related or neighboring rights to this work.
#
# http://creativecommons.org/publicdomain/zero/1.0/
use v5.16;
use Sys::Hostname;
use File::Glob ':bsd_glob';
use autodie qw(open close);
use Fcntl qw(O_WRONLY O_CREAT O_EXCL);
if (@ARGV < 2) {
die "Usage: $0 PUBLIC-INBOX MAILDIR\n";
}
my ($pi, $md) = @ARGV;
my $hostname = hostname;
my $flags = "S";
my $status = 0;
my %have;
for my $mail (bsd_glob("$md/cur/*:2*", GLOB_NOSORT)) {
if ($mail =~ /,B=([0-9a-f]{40})/) {
$have{$1} = 1;
}
}
sub deliver {
my ($repo, $blob, $time) = @_;
state $delivery = 0;
$delivery++;
# Embed the blob hash into the file name so we can check easily
# which blobs we have already.
my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s",
$time, $$, $delivery, $hostname, $blob, $flags;
my $mail = "$md/cur/$name";
my $pid = fork;
if ($pid == 0) {
sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL)
or die "$0: couldn't create $md/cur/$name: $!\n";
exec "git", "-C", "$repo", "cat-file", "blob", $blob
or die "$0: couldn't exec git: $!\n";
}
else {
waitpid $pid, 0;
if ($? == 0) {
say $mail;
} else {
$status = 1;
}
}
}
my @repos;
if (-f "$pi/HEAD") { # V1 format
push @repos, $pi;
} else { # V2 format
push @repos, bsd_glob("$pi/[0-9]*.git");
unless (@repos) {
die "$0: no V2 *.git repositories found.\n";
}
}
for my $repo (@repos) {
open(my $git_fh, "-|", "git", "-C", $repo,
qw(log --reverse --raw --format=%H:%at --no-abbrev));
my $time;
while (<$git_fh>) {
chomp;
if (/^[\da-f]{40}:(\d+)$/) {
$time = $1;
}
elsif (/^:\d{6} \d+ [\da-f]+ ([\da-f]{40}) ([AM]\tm$|A\t[\da-f]{2}\/)/) {
my $blob = $1;
next if $have{$blob};
deliver($repo, $blob, $time);
}
}
close($git_fh);
}
exit $status;
\f
cu,
--
Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org
prev parent reply other threads:[~2020-04-15 21:48 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen
2020-04-15 21:21 ` Eric Wong
2020-04-15 21:48 ` Leah Neukirchen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87v9m0l8t1.fsf@vuxu.org \
--to=leah@vuxu.org \
--cc=e@yhbt.net \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).