From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Status: No, score=-3.1 required=3.0 tests=AWL,BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 1292B1F90C for ; Wed, 15 Apr 2020 21:48:35 +0000 (UTC) Received: by mail-wr1-f44.google.com with SMTP id k11so2139398wrp.5 for ; Wed, 15 Apr 2020 14:48:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=g0U+/k11IFkrIb8y5pUd7txxZ2ycdp7DVlAm/WIy15M=; b=TMN0KoSZLILzlyMMYw+ksTLS8mqECoZm32SwJhKUB5jn1UcuD99WXULlz77252wiI1 ZtZ1aA1tOyU3miwK2lZmnVQYOnFBH9lm6HE44V5ZW5CVsRNEYK1CcEfyIqVmcAIRztdV +1dP/SLOVvAC74/pBFe8YsJBsQSTpRT9+HCGUPPW/ug1D4NASXiDplzIX/0JZ0kdjP2i /xZGNuo/RF1EFtRIF9FgitjMBJUpJsazYyRpxMPf9TWDPWjgat8bLOI8gqn+OuYq2jvu Euw151+ME4cT6Vnrsz3d3bnyOTxW4PA0+kWEaVImwWc0hIFTJQ1O/dDZ0rmMnyzdsuI+ 7+eg== X-Gm-Message-State: AGi0PuawefQhC/j66ODD/mGIkrM7lF9ZmmXHH1K0mfRxPnJV0yy+zCI4 yNQtyvThQ90TgWt3GpP5EOlfUN2F X-Google-Smtp-Source: APiQypIPYFPhgVYOA5X81xRT/MRQoBLzZYO3nwRvEJ3DOJHdmr3cwzcMcFquMkwUpkykIMUTrOgg1g== X-Received: by 2002:adf:dbce:: with SMTP id e14mr29299929wrj.337.1586987313352; Wed, 15 Apr 2020 14:48:33 -0700 (PDT) Received: from rhea.home.vuxu.org ([2001:470:6d:72e:468a:bdad:15be:dff3]) by smtp.gmail.com with ESMTPSA id u30sm4562995wru.13.2020.04.15.14.48.30 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 15 Apr 2020 14:48:31 -0700 (PDT) Received: from localhost (rhea.home.vuxu.org [local]) by rhea.home.vuxu.org (OpenSMTPD) with ESMTPA id 8c0eea9c; Wed, 15 Apr 2020 21:48:26 +0000 (UTC) From: Leah Neukirchen To: Eric Wong Cc: meta@public-inbox.org Subject: Re: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir References: <87zhbcldi4.fsf@vuxu.org> <20200415212106.GA6284@dcvr> Date: Wed, 15 Apr 2020 23:48:26 +0200 In-Reply-To: <20200415212106.GA6284@dcvr> (Eric Wong's message of "Wed, 15 Apr 2020 21:21:06 +0000") Message-ID: <87v9m0l8t1.fsf@vuxu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain List-Id: Eric Wong writes: > Cool. There's some latency introduced by the NNTP client/server > and maybe parallelizing/pipelining can help, a bit; but raw > access to git will always be faster. It also transfers quicker due to packing etc. >> for my $mail (glob("$md/cur/*:2*")) { >> if ($mail =~ /,B=([0-9a-f]{40})/) { >> $have{$1} = 1; >> } >> } > > That glob is going to be a problem with bigger Maildirs; but > then again giant Maildirs aren't great at all. > > GLOB_NOSORT can alleviate some of the slowness: > https://public-inbox.org/meta/20200111223503.24473-4-e@yhbt.net/ Good point, I implemented this. > opendir + readdir may also be used (probably slower with more > ops, but it would use less memory). For my 200k+ INBOX, Perl needs 64M for the glob; negligible these days. But GLOB_NOSORT saves almost 30%. > Maildir allows $md/$foo_state_file(s), and I'd probably > store the most recent rev-list/log state there, instead, (more > below) Yes, with more explicit state everything gets easier, seemingly. :) I wanna avoid it now. (Also, the Git reflog can be used to only look at changes since last fetch.) > It's possible for an MUA to see a partially-written file in > $md/cur/$name with the above. > > I seem to recall the correct way to write to Maildirs being: > > 1) write to "$md/tmp/" first, then link()/rename() it into "$md/new/$name" > 2) only the MUA is allowed to rename() files from /new/ => /cur/ True, this will need a bit more code. >> if (-f "$pi/HEAD") { # V1 format >> open(my $git_fh, "-|", "git", "-C", $pi, >> qw(log --reverse --raw --format=%H:%at --no-abbrev)); > > Not sure what --reverse buys, here; but this is another > place where things get slower as more messages are added. It will write the files in historical order, perhaps that yields some performance benefits later. > Keep in mind that's lexically sorted, so order becomes > unexpected when 10.git rolls around (LKML is at 8.git, now) > I don't believe order really matters, though... I don't think it matters. > That's a bit odd and seems unnecessary. Using > (log --raw --format=%H:%at --no-abbrev) similar to how you do > with the v1 code should still work for v2. Yes, I wrote V2 first and then realized V1 is easier and works just as well. Current version: #!/usr/bin/perl -w # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir # # To the extent possible under law, Leah Neukirchen has waived # all copyright and related or neighboring rights to this work. # # http://creativecommons.org/publicdomain/zero/1.0/ use v5.16; use Sys::Hostname; use File::Glob ':bsd_glob'; use autodie qw(open close); use Fcntl qw(O_WRONLY O_CREAT O_EXCL); if (@ARGV < 2) { die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; } my ($pi, $md) = @ARGV; my $hostname = hostname; my $flags = "S"; my $status = 0; my %have; for my $mail (bsd_glob("$md/cur/*:2*", GLOB_NOSORT)) { if ($mail =~ /,B=([0-9a-f]{40})/) { $have{$1} = 1; } } sub deliver { my ($repo, $blob, $time) = @_; state $delivery = 0; $delivery++; # Embed the blob hash into the file name so we can check easily # which blobs we have already. my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", $time, $$, $delivery, $hostname, $blob, $flags; my $mail = "$md/cur/$name"; my $pid = fork; if ($pid == 0) { sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) or die "$0: couldn't create $md/cur/$name: $!\n"; exec "git", "-C", "$repo", "cat-file", "blob", $blob or die "$0: couldn't exec git: $!\n"; } else { waitpid $pid, 0; if ($? == 0) { say $mail; } else { $status = 1; } } } my @repos; if (-f "$pi/HEAD") { # V1 format push @repos, $pi; } else { # V2 format push @repos, bsd_glob("$pi/[0-9]*.git"); unless (@repos) { die "$0: no V2 *.git repositories found.\n"; } } for my $repo (@repos) { open(my $git_fh, "-|", "git", "-C", $repo, qw(log --reverse --raw --format=%H:%at --no-abbrev)); my $time; while (<$git_fh>) { chomp; if (/^[\da-f]{40}:(\d+)$/) { $time = $1; } elsif (/^:\d{6} \d+ [\da-f]+ ([\da-f]{40}) ([AM]\tm$|A\t[\da-f]{2}\/)/) { my $blob = $1; next if $have{$blob}; deliver($repo, $blob, $time); } } close($git_fh); } exit $status; cu, -- Leah Neukirchen https://leahneukirchen.org