* impibe: incrementally add messages from public-inbox V1/V2 to a Maildir @ 2020-04-15 20:06 Leah Neukirchen 2020-04-15 21:21 ` Eric Wong 0 siblings, 1 reply; 3+ messages in thread From: Leah Neukirchen @ 2020-04-15 20:06 UTC (permalink / raw) To: meta Hi *, I wrote a small script today (see below), and maybe it is useful for you, too. The task was to incrementally add messages from a public-inbox V1 and V2 repo on local disk to a Maildir. Note that "ssoma sync" only does V1. impibe encodes the seen blob hashes into Maildir file names, so only new messages are added on runs afterwards. Speed is quite good. The only dependencies are git and perl. This allows having local Maildir mirrors of public-inbox clones, for example to use with other Maildir indexers or MUA. It is way faster than using NNTP to sync up. I don't have much Perl experience, so please tell me any problems with my code. \f #!/usr/bin/perl -w # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir # # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived # all copyright and related or neighboring rights to this work. # # http://creativecommons.org/publicdomain/zero/1.0/ use v5.16; use Sys::Hostname; use autodie qw(open close fork); use Fcntl qw(O_WRONLY O_CREAT O_EXCL); if (@ARGV < 2) { die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; } my $pi = $ARGV[0]; my $md = $ARGV[1]; my $hostname = hostname; my $flags = "S"; my %have; for my $mail (glob("$md/cur/*:2*")) { if ($mail =~ /,B=([0-9a-f]{40})/) { $have{$1} = 1; } } my $status = 0; sub deliver { my ($repo, $blob, $time) = @_; state $delivery = 0; $delivery++; # Embed the blob hash into the file name so we can check easily # which blobs we have already. my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", $time, $$, $delivery, $hostname, $blob, $flags; my $mail = "$md/cur/$name"; my $pid = fork; if ($pid == 0) { no autodie 'sysopen'; sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) || die "$0: couldn't create $md/cur/$name: $!\n"; exec "git", "-C", "$repo", "cat-file", "blob", $blob } elsif (defined $pid) { waitpid $pid, 0; if ($? == 0) { say $mail; } else { $status = 1; } } } if (-f "$pi/HEAD") { # V1 format open(my $git_fh, "-|", "git", "-C", $pi, qw(log --reverse --raw --format=%H:%at --no-abbrev)); my $time; while (<$git_fh>) { chomp; if (/^[0-9a-f]{40}:(\d+)$/) { $time = $1; } elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) { my $blob = $1; next if $have{$blob}; deliver($pi, $blob, $time); } } close($git_fh); } else { # V2 format my @repos = glob("$pi/[0-9]*.git"); unless (@repos) { die "$0: no V2 *.git repositories found.\n"; } for my $repo (@repos) { my %authored_at; # We store the commit dates per tree(!), not per commit, # because "git rev-list --objects" will print only the tree. # But trees are as unique as messages. open(my $git_fh, "-|", "git", "-C", $repo, qw(log --pretty=%T:%at @)); while (<$git_fh>) { chomp; my ($ref, $time) = split(':'); $authored_at{$ref} = $time; } close($git_fh); open($git_fh, "-|", "git", "-C", $repo, qw(rev-list --reverse --objects @)); my $time; my $blob; while (<$git_fh>) { chomp; if (/ $/) { $time = $authored_at{$`}; } elsif (/ m$/) { $blob = $`; next if $have{$blob}; deliver($repo, $blob, $time); } # XXX 'd' (= deleted) messages are not respected. } close($git_fh); } } exit $status; \f Enjoy, -- Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org/ ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir 2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen @ 2020-04-15 21:21 ` Eric Wong 2020-04-15 21:48 ` Leah Neukirchen 0 siblings, 1 reply; 3+ messages in thread From: Eric Wong @ 2020-04-15 21:21 UTC (permalink / raw) To: Leah Neukirchen; +Cc: meta Leah Neukirchen <leah@vuxu.org> wrote: > Hi *, > > I wrote a small script today (see below), and maybe it is useful for > you, too. The task was to incrementally add messages from a > public-inbox V1 and V2 repo on local disk to a Maildir. Note that > "ssoma sync" only does V1. > > impibe encodes the seen blob hashes into Maildir file names, so only > new messages are added on runs afterwards. Speed is quite good. > The only dependencies are git and perl. > > This allows having local Maildir mirrors of public-inbox clones, for > example to use with other Maildir indexers or MUA. It is way faster > than using NNTP to sync up. Cool. There's some latency introduced by the NNTP client/server and maybe parallelizing/pipelining can help, a bit; but raw access to git will always be faster. > I don't have much Perl experience, so please tell me any problems with > my code. No problem. Nothing really Perl-specific, most of my comments would apply to other languages. > \f > #!/usr/bin/perl -w > # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir > # > # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived > # all copyright and related or neighboring rights to this work. > # > # http://creativecommons.org/publicdomain/zero/1.0/ > > use v5.16; > use Sys::Hostname; > use autodie qw(open close fork); > use Fcntl qw(O_WRONLY O_CREAT O_EXCL); > > if (@ARGV < 2) { > die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; > } > > my $pi = $ARGV[0]; > my $md = $ARGV[1]; > > my $hostname = hostname; > my $flags = "S"; > > my %have; > > for my $mail (glob("$md/cur/*:2*")) { > if ($mail =~ /,B=([0-9a-f]{40})/) { > $have{$1} = 1; > } > } That glob is going to be a problem with bigger Maildirs; but then again giant Maildirs aren't great at all. GLOB_NOSORT can alleviate some of the slowness: https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/ opendir + readdir may also be used (probably slower with more ops, but it would use less memory). Maildir allows $md/$foo_state_file(s), and I'd probably store the most recent rev-list/log state there, instead, (more below) > my $status = 0; > > sub deliver { > my ($repo, $blob, $time) = @_; > state $delivery = 0; > > $delivery++; > # Embed the blob hash into the file name so we can check easily > # which blobs we have already. > my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", > $time, $$, $delivery, $hostname, $blob, $flags; > > my $mail = "$md/cur/$name"; > > my $pid = fork; > if ($pid == 0) { > no autodie 'sysopen'; > sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) || > die "$0: couldn't create $md/cur/$name: $!\n"; > exec "git", "-C", "$repo", "cat-file", "blob", $blob > } > elsif (defined $pid) { > waitpid $pid, 0; > if ($? == 0) { > say $mail; > } else { > $status = 1; > } > } It's possible for an MUA to see a partially-written file in $md/cur/$name with the above. I seem to recall the correct way to write to Maildirs being: 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name" 2) only the MUA is allowed to rename() files from /new/ => /cur/ That would require more globbing, or capturing the most recent commit hash and storing it in $foo_state_file > if (-f "$pi/HEAD") { # V1 format > open(my $git_fh, "-|", "git", "-C", $pi, > qw(log --reverse --raw --format=%H:%at --no-abbrev)); Not sure what --reverse buys, here; but this is another place where things get slower as more messages are added. Storing the most recent commit hash would allow avoiding the memory overhead of giant %have and %authored_at tables; that will improve fork() performance on Linux. Being able to use shorter filenames might save some space in the kernel dentry cache, too :) > my $time; > while (<$git_fh>) { > chomp; > if (/^[0-9a-f]{40}:(\d+)$/) { > $time = $1; > } > elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) { > my $blob = $1; > > next if $have{$blob}; > deliver($pi, $blob, $time); > } > } > close($git_fh); > } else { # V2 format > my @repos = glob("$pi/[0-9]*.git"); Keep in mind that's lexically sorted, so order becomes unexpected when 10.git rolls around (LKML is at 8.git, now) I don't believe order really matters, though... > unless (@repos) { > die "$0: no V2 *.git repositories found.\n"; > } > > for my $repo (@repos) { > my %authored_at; > > # We store the commit dates per tree(!), not per commit, > # because "git rev-list --objects" will print only the tree. > # But trees are as unique as messages. > open(my $git_fh, "-|", "git", "-C", $repo, > qw(log --pretty=%T:%at @)); > while (<$git_fh>) { > chomp; > my ($ref, $time) = split(':'); > $authored_at{$ref} = $time; > } > close($git_fh); That's a bit odd and seems unnecessary. Using (log --raw --format=%H:%at --no-abbrev) similar to how you do with the v1 code should still work for v2. > open($git_fh, "-|", "git", "-C", $repo, > qw(rev-list --reverse --objects @)); > my $time; > my $blob; > while (<$git_fh>) { > chomp; > if (/ $/) { > $time = $authored_at{$`}; > } > elsif (/ m$/) { > $blob = $`; > > next if $have{$blob}; > deliver($repo, $blob, $time); Being an old Perl user myself, $` is one of those things I quickly learned to avoid (and promptly forgot the exact meaning of, so I had to check the perlvar manpage). It appears the performance problems associated with $` is no longer relevant in newer Perls. However, using /([0-9a-f]{40})/ and $1 as you do with the v1 code can improve readability a bit. In general, I think the v2 code path should be more similar to the v1 path. And I've been meaning to change {40} to {40,} in places to deal with the git SHA-256 (and beyond) support. > } > # XXX 'd' (= deleted) messages are not respected. > } > > close($git_fh); > } > } > > exit $status; > \f > > Enjoy, Thanks for sharing! ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir 2020-04-15 21:21 ` Eric Wong @ 2020-04-15 21:48 ` Leah Neukirchen 0 siblings, 0 replies; 3+ messages in thread From: Leah Neukirchen @ 2020-04-15 21:48 UTC (permalink / raw) To: Eric Wong; +Cc: meta Eric Wong <e@yhbt.net> writes: > Cool. There's some latency introduced by the NNTP client/server > and maybe parallelizing/pipelining can help, a bit; but raw > access to git will always be faster. It also transfers quicker due to packing etc. >> for my $mail (glob("$md/cur/*:2*")) { >> if ($mail =~ /,B=([0-9a-f]{40})/) { >> $have{$1} = 1; >> } >> } > > That glob is going to be a problem with bigger Maildirs; but > then again giant Maildirs aren't great at all. > > GLOB_NOSORT can alleviate some of the slowness: > https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/ Good point, I implemented this. > opendir + readdir may also be used (probably slower with more > ops, but it would use less memory). For my 200k+ INBOX, Perl needs 64M for the glob; negligible these days. But GLOB_NOSORT saves almost 30%. > Maildir allows $md/$foo_state_file(s), and I'd probably > store the most recent rev-list/log state there, instead, (more > below) Yes, with more explicit state everything gets easier, seemingly. :) I wanna avoid it now. (Also, the Git reflog can be used to only look at changes since last fetch.) > It's possible for an MUA to see a partially-written file in > $md/cur/$name with the above. > > I seem to recall the correct way to write to Maildirs being: > > 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name" > 2) only the MUA is allowed to rename() files from /new/ => /cur/ True, this will need a bit more code. >> if (-f "$pi/HEAD") { # V1 format >> open(my $git_fh, "-|", "git", "-C", $pi, >> qw(log --reverse --raw --format=%H:%at --no-abbrev)); > > Not sure what --reverse buys, here; but this is another > place where things get slower as more messages are added. It will write the files in historical order, perhaps that yields some performance benefits later. > Keep in mind that's lexically sorted, so order becomes > unexpected when 10.git rolls around (LKML is at 8.git, now) > I don't believe order really matters, though... I don't think it matters. > That's a bit odd and seems unnecessary. Using > (log --raw --format=%H:%at --no-abbrev) similar to how you do > with the v1 code should still work for v2. Yes, I wrote V2 first and then realized V1 is easier and works just as well. Current version: \f #!/usr/bin/perl -w # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir # # To the extent possible under law, Leah Neukirchen <leah@vuxu.org> has waived # all copyright and related or neighboring rights to this work. # # http://creativecommons.org/publicdomain/zero/1.0/ use v5.16; use Sys::Hostname; use File::Glob ':bsd_glob'; use autodie qw(open close); use Fcntl qw(O_WRONLY O_CREAT O_EXCL); if (@ARGV < 2) { die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; } my ($pi, $md) = @ARGV; my $hostname = hostname; my $flags = "S"; my $status = 0; my %have; for my $mail (bsd_glob("$md/cur/*:2*", GLOB_NOSORT)) { if ($mail =~ /,B=([0-9a-f]{40})/) { $have{$1} = 1; } } sub deliver { my ($repo, $blob, $time) = @_; state $delivery = 0; $delivery++; # Embed the blob hash into the file name so we can check easily # which blobs we have already. my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", $time, $$, $delivery, $hostname, $blob, $flags; my $mail = "$md/cur/$name"; my $pid = fork; if ($pid == 0) { sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) or die "$0: couldn't create $md/cur/$name: $!\n"; exec "git", "-C", "$repo", "cat-file", "blob", $blob or die "$0: couldn't exec git: $!\n"; } else { waitpid $pid, 0; if ($? == 0) { say $mail; } else { $status = 1; } } } my @repos; if (-f "$pi/HEAD") { # V1 format push @repos, $pi; } else { # V2 format push @repos, bsd_glob("$pi/[0-9]*.git"); unless (@repos) { die "$0: no V2 *.git repositories found.\n"; } } for my $repo (@repos) { open(my $git_fh, "-|", "git", "-C", $repo, qw(log --reverse --raw --format=%H:%at --no-abbrev)); my $time; while (<$git_fh>) { chomp; if (/^[\da-f]{40}:(\d+)$/) { $time = $1; } elsif (/^:\d{6} \d+ [\da-f]+ ([\da-f]{40}) ([AM]\tm$|A\t[\da-f]{2}\/)/) { my $blob = $1; next if $have{$blob}; deliver($repo, $blob, $time); } } close($git_fh); } exit $status; \f cu, -- Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-04-15 21:48 UTC | newest] Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-04-15 20:06 impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Leah Neukirchen 2020-04-15 21:21 ` Eric Wong 2020-04-15 21:48 ` Leah Neukirchen
Code repositories for project(s) associated with this public inbox https://80x24.org/public-inbox.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).