From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Status: No, score=-2.9 required=3.0 tests=AWL,BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from mail-wm1-f67.google.com (mail-wm1-f67.google.com [209.85.128.67]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 78B751F751 for ; Wed, 15 Apr 2020 20:07:05 +0000 (UTC) Received: by mail-wm1-f67.google.com with SMTP id a81so1293796wmf.5 for ; Wed, 15 Apr 2020 13:07:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:user-agent :mime-version; bh=ypj3TsV7xEpVGP00M29mDl+nRYPwr8aKkiVOt0Gy94E=; b=JRZurzlMNH1tayxgxQZRP2h8usRm78xC2arUc1JHzAGY3WUuQOm2lAw+gcSmr1WnNZ hb9pmJiNdxxioaWwXtUFNtiwlmLx7TSaVdIVoU9xegjZpKaga08N3WdFotd2F5WcBuSZ p0nnjsg7gk+i3pp7MULhqYlXyQZ3GeRrrwjOikDP3I2g4e5FHNrktY5DiAYAZetCdY4X ii2h0i4XtdMYZUc5gklg2lawXfKbSQLWaKueKkouisJfWLxBzKM6OzH9F9bPWe3FyTUT /TxWZKR3yLpROfGSWtUz+XcIxvDlJh9KFijN1TBBUlsYWsPlaga6/on//HXwS5B6j6sg s85w== X-Gm-Message-State: AGi0PuYCuBkQDI2CXLf1pj4Yo5Ll0rwWQuHmi2VwYt22Y0Q8fMK+dl0X aiUTuP/7DnjrkBg7SZ9948i3fzCU X-Google-Smtp-Source: APiQypKapg7pyUnjqQ0BXevgAuG4IZzzXsWKkG7GtxBl+2jAd2O7csm7VQjmQAcrwIvKijJF8uSG6Q== X-Received: by 2002:a7b:c3d4:: with SMTP id t20mr1014835wmj.170.1586981223271; Wed, 15 Apr 2020 13:07:03 -0700 (PDT) Received: from rhea.home.vuxu.org ([2001:470:6d:72e:468a:bdad:15be:dff3]) by smtp.gmail.com with ESMTPSA id y40sm1475358wrd.20.2020.04.15.13.07.01 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 15 Apr 2020 13:07:02 -0700 (PDT) Received: from localhost (rhea.home.vuxu.org [local]) by rhea.home.vuxu.org (OpenSMTPD) with ESMTPA id b421f92e for ; Wed, 15 Apr 2020 20:06:59 +0000 (UTC) From: Leah Neukirchen To: meta@public-inbox.org Subject: impibe: incrementally add messages from public-inbox V1/V2 to a Maildir Date: Wed, 15 Apr 2020 22:06:59 +0200 Message-ID: <87zhbcldi4.fsf@vuxu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain List-Id: Hi *, I wrote a small script today (see below), and maybe it is useful for you, too. The task was to incrementally add messages from a public-inbox V1 and V2 repo on local disk to a Maildir. Note that "ssoma sync" only does V1. impibe encodes the seen blob hashes into Maildir file names, so only new messages are added on runs afterwards. Speed is quite good. The only dependencies are git and perl. This allows having local Maildir mirrors of public-inbox clones, for example to use with other Maildir indexers or MUA. It is way faster than using NNTP to sync up. I don't have much Perl experience, so please tell me any problems with my code. #!/usr/bin/perl -w # impibe - incrementally add messages from public-inbox V1/V2 to a Maildir # # To the extent possible under law, Leah Neukirchen has waived # all copyright and related or neighboring rights to this work. # # http://creativecommons.org/publicdomain/zero/1.0/ use v5.16; use Sys::Hostname; use autodie qw(open close fork); use Fcntl qw(O_WRONLY O_CREAT O_EXCL); if (@ARGV < 2) { die "Usage: $0 PUBLIC-INBOX MAILDIR\n"; } my $pi = $ARGV[0]; my $md = $ARGV[1]; my $hostname = hostname; my $flags = "S"; my %have; for my $mail (glob("$md/cur/*:2*")) { if ($mail =~ /,B=([0-9a-f]{40})/) { $have{$1} = 1; } } my $status = 0; sub deliver { my ($repo, $blob, $time) = @_; state $delivery = 0; $delivery++; # Embed the blob hash into the file name so we can check easily # which blobs we have already. my $name = sprintf "%d.P%05dQ%d.%s,B=%s:2,%s", $time, $$, $delivery, $hostname, $blob, $flags; my $mail = "$md/cur/$name"; my $pid = fork; if ($pid == 0) { no autodie 'sysopen'; sysopen(STDOUT, $mail, O_WRONLY | O_CREAT | O_EXCL) || die "$0: couldn't create $md/cur/$name: $!\n"; exec "git", "-C", "$repo", "cat-file", "blob", $blob } elsif (defined $pid) { waitpid $pid, 0; if ($? == 0) { say $mail; } else { $status = 1; } } } if (-f "$pi/HEAD") { # V1 format open(my $git_fh, "-|", "git", "-C", $pi, qw(log --reverse --raw --format=%H:%at --no-abbrev)); my $time; while (<$git_fh>) { chomp; if (/^[0-9a-f]{40}:(\d+)$/) { $time = $1; } elsif (/^:000000 \d+ 0{40} ([0-9a-f]{40}) A/) { my $blob = $1; next if $have{$blob}; deliver($pi, $blob, $time); } } close($git_fh); } else { # V2 format my @repos = glob("$pi/[0-9]*.git"); unless (@repos) { die "$0: no V2 *.git repositories found.\n"; } for my $repo (@repos) { my %authored_at; # We store the commit dates per tree(!), not per commit, # because "git rev-list --objects" will print only the tree. # But trees are as unique as messages. open(my $git_fh, "-|", "git", "-C", $repo, qw(log --pretty=%T:%at @)); while (<$git_fh>) { chomp; my ($ref, $time) = split(':'); $authored_at{$ref} = $time; } close($git_fh); open($git_fh, "-|", "git", "-C", $repo, qw(rev-list --reverse --objects @)); my $time; my $blob; while (<$git_fh>) { chomp; if (/ $/) { $time = $authored_at{$`}; } elsif (/ m$/) { $blob = $`; next if $have{$blob}; deliver($repo, $blob, $time); } # XXX 'd' (= deleted) messages are not respected. } close($git_fh); } } exit $status; Enjoy, -- Leah Neukirchen https://leahneukirchen.org/