user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/3] initial importer work
@ 2016-04-09  9:14 Eric Wong
  2016-04-09  9:14 ` [PATCH 1/3] git: add support for qx wrapper Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-09  9:14 UTC (permalink / raw)
  To: meta

There's more in the 'import' branch, but I think the following
three are appropriate for 'master'.  There'll be a command-line
tool of some sort to handle various flavors of mbox, Maildir,
and perhaps PublicInbox::Import can be a documented module for
writing importers for other data formats.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] git: add support for qx wrapper
  2016-04-09  9:14 [PATCH 0/3] initial importer work Eric Wong
@ 2016-04-09  9:14 ` Eric Wong
  2016-04-09  9:14 ` [PATCH 2/3] import: initial module + test case Eric Wong
  2016-04-09  9:14 ` [PATCH 3/3] import: set binmode before printing author names Eric Wong
  2 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-09  9:14 UTC (permalink / raw)
  To: meta

This lets us one-line git commands easily like ``, but without
having to remember --git-dir or escape arguments.
---
 lib/PublicInbox/Git.pm | 8 ++++++++
 t/git.t                | 8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 0f92dd9..c406c03 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -117,6 +117,14 @@ sub popen {
 	popen_rd(\@cmd);
 }
 
+sub qx {
+	my ($self, @cmd) = @_;
+	my $fh = $self->popen(@cmd);
+	return <$fh> if wantarray;
+	local $/;
+	<$fh>
+}
+
 sub cleanup {
 	my ($self) = @_;
 	_destroy($self, qw(in out pid));
diff --git a/t/git.t b/t/git.t
index 0f3dbae..d7b20d0 100644
--- a/t/git.t
+++ b/t/git.t
@@ -129,6 +129,14 @@ if (1) {
 		local $/;
 		is($all, <$fh>, 'entire read matches');
 	};
+
+	my $ref = $gcf->qx(qw(cat-file blob), $buf);
+	is($all, $ref, 'qx read giant single string');
+
+	my @ref = $gcf->qx(qw(cat-file blob), $buf);
+	is($all, join('', @ref), 'qx returned array when wanted');
+	my $nl = scalar @ref;
+	ok($nl > 1, "qx returned array length of $nl");
 }
 
 done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] import: initial module + test case
  2016-04-09  9:14 [PATCH 0/3] initial importer work Eric Wong
  2016-04-09  9:14 ` [PATCH 1/3] git: add support for qx wrapper Eric Wong
@ 2016-04-09  9:14 ` Eric Wong
  2016-04-09  9:14 ` [PATCH 3/3] import: set binmode before printing author names Eric Wong
  2 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-09  9:14 UTC (permalink / raw)
  To: meta

This will allow us to write fast importers for existing
archives as well as eventually removing the ssoma dependency
for performance and ease-of-installation.
---
 lib/PublicInbox/Import.pm | 190 ++++++++++++++++++++++++++++++++++++++++++++++
 t/import.t                |  65 ++++++++++++++++
 2 files changed, 255 insertions(+)
 create mode 100644 lib/PublicInbox/Import.pm
 create mode 100644 t/import.t

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
new file mode 100644
index 0000000..e2156f1
--- /dev/null
+++ b/lib/PublicInbox/Import.pm
@@ -0,0 +1,190 @@
+# Copyright (C) 2016 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+#
+# git fast-import-based ssoma-mda MDA replacement
+# This is only ever run by public-inbox-mda and public-inbox-learn,
+# not the WWW or NNTP code which only requires read-only access.
+package PublicInbox::Import;
+use strict;
+use warnings;
+use Fcntl qw(:flock :DEFAULT);
+use Email::Address;
+use PublicInbox::Spawn qw(spawn);
+use PublicInbox::MID qw(mid_mime mid2path);
+
+sub new {
+	my ($class, $git, $name, $email) = @_;
+	bless {
+		git => $git,
+		ident => "$name <$email>",
+		mark => 1,
+		ref => 'refs/heads/master',
+	}, $class
+}
+
+# idempotent start function
+sub gfi_start {
+	my ($self) = @_;
+
+	return ($self->{in}, $self->{out}) if $self->{pid};
+
+	my ($in_r, $in_w, $out_r, $out_w);
+	pipe($in_r, $in_w) or die "pipe failed: $!";
+	pipe($out_r, $out_w) or die "pipe failed: $!";
+	my $git = $self->{git};
+	my $git_dir = $git->{git_dir};
+	my $lockpath = "$git_dir/ssoma.lock";
+	sysopen(my $lockfh, $lockpath, O_WRONLY|O_CREAT) or
+		die "failed to open lock $lockpath: $!";
+
+	# wait for other processes to be done
+	flock($lockfh, LOCK_EX) or die "lock failed: $!\n";
+	chomp($self->{tip} = $git->qx(qw(rev-parse --revs-only), $self->{ref}));
+
+	my @cmd = ('git', "--git-dir=$git_dir", qw(fast-import
+			--quiet --done --date-format=rfc2822));
+	my $rdr = { 0 => fileno($out_r), 1 => fileno($in_w) };
+	my $pid = spawn(\@cmd, undef, $rdr);
+	die "spawn failed: $!" unless defined $pid;
+	$out_w->autoflush(1);
+	$self->{in} = $in_r;
+	$self->{out} = $out_w;
+	$self->{lockfh} = $lockfh;
+	$self->{pid} = $pid;
+	($in_r, $out_w);
+}
+
+sub wfail () { die "write to fast-import failed: $!" }
+
+sub now2822 () {
+	my @t = gmtime(time);
+	my $day = qw(Sun Mon Tue Wed Thu Fri Sat)[$t[6]];
+	my $mon = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)[$t[4]];
+
+	sprintf('%s, %2d %s %d %02d:%02d:%02d +0000',
+		$day, $t[3], $mon, $t[5] + 1900, $t[2], $t[1], $t[0]);
+}
+
+# returns undef on non-existent
+# (-1, msg) on mismatch
+# (:MARK, msg) on success
+sub remove {
+	my ($self, $mime) = @_; # mime = Email::MIME
+
+	my $mid = mid_mime($mime);
+	my $path = mid2path($mid);
+
+	my ($r, $w) = $self->gfi_start;
+	my $tip = $self->{tip};
+	return if $tip eq '';
+
+	print $w "ls $tip $path\n" or wfail;
+	local $/ = "\n";
+	my $check = <$r>;
+	defined $check or die "EOF from fast-import / ls: $!";
+	return if $check =~ /\Amissing /;
+	$check =~ m!\A100644 blob ([a-f0-9]{40})\t!s or die "not blob: $check";
+	my $blob = $1;
+	print $w "cat-blob $blob\n" or wfail;
+	$check = <$r>;
+	defined $check or die "EOF from fast-import / cat-blob: $!";
+	$check =~ /\A[a-f0-9]{40} blob (\d+)\n\z/ or
+				die "unexpected cat-blob response: $check";
+	my $left = $1;
+	my $offset = 0;
+	my $buf = '';
+	while ($left > 0) {
+		my $n = read($r, $buf, $left, $offset);
+		defined($n) or die "read cat-blob failed: $!";
+		$n == 0 and die 'fast-export (cat-blob) died';
+		$left -= $n;
+		$offset += $n;
+	}
+	read($r, my $lf, 1);
+	die "bad read on final byte: <$lf>" if $lf ne "\n";
+	my $cur = Email::MIME->new($buf);
+	if ($cur->header('Subject') ne $mime->header('Subject') ||
+			$cur->body ne $mime->body) {
+		return (-1, $cur);
+	}
+
+	my $ref = $self->{ref};
+	my $commit = $self->{mark}++;
+	my $parent = $tip =~ /\A:/ ? $tip : undef;
+	unless ($parent) {
+		print $w "reset $ref\n" or wfail;
+	}
+	my $ident = $self->{ident};
+	my $now = now2822();
+	print $w "commit $ref\nmark :$commit\n",
+		"author $ident $now\n",
+		"committer $ident $now\n",
+		"data 3\nrm\n\n",
+		'from ', ($parent ? $parent : $tip), "\n" or wfail;
+	print $w "D $path\n\n" or wfail;
+	(($self->{tip} = ":$commit"), $cur);
+}
+
+# returns undef on duplicate
+sub add {
+	my ($self, $mime) = @_; # mime = Email::MIME
+
+	my $from = $mime->header('From');
+	my @from = Email::Address->parse($from);
+	my $name = $from[0]->name;
+	my $email = $from[0]->address;
+	my $date = $mime->header('Date');
+	my $subject = $mime->header('Subject');
+	$subject = '(no subject)' unless defined $subject;
+	my $mid = mid_mime($mime);
+	my $path = mid2path($mid);
+
+	my ($r, $w) = $self->gfi_start;
+	my $tip = $self->{tip};
+	if ($tip ne '') {
+		print $w "ls $tip $path\n" or wfail;
+		local $/ = "\n";
+		my $check = <$r>;
+		defined $check or die "EOF from fast-import: $!";
+		return unless $check =~ /\Amissing /;
+	}
+
+	# kill potentially confusing/misleading headers
+	$mime->header_set($_) for qw(bytes lines content-length status);
+	$mime = $mime->as_string;
+	my $blob = $self->{mark}++;
+	print $w "blob\nmark :$blob\ndata ", length($mime), "\n" or wfail;
+	print $w $mime, "\n" or wfail;
+	my $ref = $self->{ref};
+	my $commit = $self->{mark}++;
+	my $parent = $tip =~ /\A:/ ? $tip : undef;
+
+	unless ($parent) {
+		print $w "reset $ref\n" or wfail;
+	}
+	print $w "commit $ref\nmark :$commit\n",
+		"author $name <$email> $date\n",
+		"committer $self->{ident} ", now2822(), "\n",
+		"data ", (length($subject) + 1), "\n",
+		$subject, "\n\n" or wfail;
+	if ($tip ne '') {
+		print $w 'from ', ($parent ? $parent : $tip), "\n" or wfail;
+	}
+	print $w "M 100644 :$blob $path\n\n" or wfail;
+	$self->{tip} = ":$commit";
+}
+
+sub done {
+	my ($self) = @_;
+	my $w = delete $self->{out} or return;
+	my $r = delete $self->{in} or die 'BUG: missing {in} when done';
+	print $w "done\n" or wfail;
+	my $pid = delete $self->{pid} or die 'BUG: missing {pid} when done';
+	waitpid($pid, 0) == $pid or die 'fast-import did not finish';
+	$? == 0 or die "fast-import failed: $?";
+	my $lockfh = delete $self->{lockfh} or die "BUG: not locked: $!";
+	flock($lockfh, LOCK_UN) or die "unlock failed: $!";
+	close $lockfh or die "close lock failed: $!";
+}
+
+1;
diff --git a/t/import.t b/t/import.t
new file mode 100644
index 0000000..6918484
--- /dev/null
+++ b/t/import.t
@@ -0,0 +1,65 @@
+# Copyright (C) 2016 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use Email::MIME;
+use PublicInbox::Git;
+use PublicInbox::Import;
+use File::Temp qw/tempdir/;
+my $dir = tempdir('pi-import-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+
+is(system(qw(git init -q --bare), $dir), 0, 'git init successful');
+my $git = PublicInbox::Git->new($dir);
+
+my $im = PublicInbox::Import->new($git, 'testbox', 'test@example');
+my $mime = Email::MIME->create(
+	header => [
+		From => 'a@example.com',
+		To => 'b@example.com',
+		'Content-Type' => 'text/plain',
+		Subject => 'this is a subject',
+		'Message-ID' => '<a@example.com>',
+	],
+	body => "hello world\n",
+);
+like($im->add($mime), qr/\A:\d+\z/, 'added one message');
+$im->done;
+my @revs = $git->qx(qw(rev-list HEAD));
+is(scalar @revs, 1, 'one revision created');
+
+$mime->header_set('Message-ID', '<b@example.com>');
+$mime->header_set('Subject', 'msg2');
+like($im->add($mime), qr/\A:\d+\z/, 'added 2nd message');
+$im->done;
+@revs = $git->qx(qw(rev-list HEAD));
+is(scalar @revs, 2, '2 revisions exist');
+
+is($im->add($mime), undef, 'message only inserted once');
+$im->done;
+@revs = $git->qx(qw(rev-list HEAD));
+is(scalar @revs, 2, '2 revisions exist');
+
+foreach my $c ('c'..'z') {
+	$mime->header_set('Message-ID', "<$c\@example.com>");
+	$mime->header_set('Subject', "msg - $c");
+	like($im->add($mime), qr/\A:\d+\z/, "added $c message");
+}
+$im->done;
+@revs = $git->qx(qw(rev-list HEAD));
+is(scalar @revs, 26, '26 revisions exist after mass import');
+my ($mark, $msg) = $im->remove($mime);
+like($mark, qr/\A:\d+\z/, 'got mark');
+is(ref($msg), 'Email::MIME', 'got old message deleted');
+
+is(undef, $im->remove($mime), 'remove is idempotent');
+
+# mismatch on identical Message-ID
+$mime->header_set('Message-ID', '<a@example.com>');
+($mark, $msg) = $im->remove($mime);
+is($mark, -1, 'mark == -1 on mismatch');
+is($msg->header('Message-ID'), '<a@example.com>', 'Message-ID matches');
+isnt($msg->header('Subject'), $mime->header('Subject'), 'subject mismatch');
+
+$im->done;
+done_testing();
-- 
EW


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] import: set binmode before printing author names
  2016-04-09  9:14 [PATCH 0/3] initial importer work Eric Wong
  2016-04-09  9:14 ` [PATCH 1/3] git: add support for qx wrapper Eric Wong
  2016-04-09  9:14 ` [PATCH 2/3] import: initial module + test case Eric Wong
@ 2016-04-09  9:14 ` Eric Wong
  2016-04-11  5:20   ` [PATCH 4/3] import: use bytes::length for true data length in bytes Eric Wong
  2 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2016-04-09  9:14 UTC (permalink / raw)
  To: meta

Author names may have wide characters in them, so avoid warnings
as git favors UTF-8 for names and fast-import even requires them
for commit messages
---
 lib/PublicInbox/Import.pm | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index e2156f1..4c7bbf1 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -162,11 +162,16 @@ sub add {
 	unless ($parent) {
 		print $w "reset $ref\n" or wfail;
 	}
+
+	# quiet down wide character warnings:
+	binmode $w, ':utf8' or die "binmode :utf8 failed: $!";
 	print $w "commit $ref\nmark :$commit\n",
 		"author $name <$email> $date\n",
 		"committer $self->{ident} ", now2822(), "\n",
 		"data ", (length($subject) + 1), "\n",
 		$subject, "\n\n" or wfail;
+	binmode $w, ':raw' or die "binmode :raw failed: $!";
+
 	if ($tip ne '') {
 		print $w 'from ', ($parent ? $parent : $tip), "\n" or wfail;
 	}
-- 
EW


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/3] import: use bytes::length for true data length in bytes
  2016-04-09  9:14 ` [PATCH 3/3] import: set binmode before printing author names Eric Wong
@ 2016-04-11  5:20   ` Eric Wong
  2016-04-12 21:18     ` [PATCH 5/3] import: filter out [<>] from user names Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2016-04-11  5:20 UTC (permalink / raw)
  To: meta

git is byte-oriented and fast-import will not tolerate
miscalculations.  This is necessary for wide characters
in commit messages (email Subjects).
---
 lib/PublicInbox/Import.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 4c7bbf1..5bae69d 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -168,7 +168,7 @@ sub add {
 	print $w "commit $ref\nmark :$commit\n",
 		"author $name <$email> $date\n",
 		"committer $self->{ident} ", now2822(), "\n",
-		"data ", (length($subject) + 1), "\n",
+		"data ", (bytes::length($subject) + 1), "\n",
 		$subject, "\n\n" or wfail;
 	binmode $w, ':raw' or die "binmode :raw failed: $!";
 
-- 
EW

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 5/3] import: filter out [<>] from user names
  2016-04-11  5:20   ` [PATCH 4/3] import: use bytes::length for true data length in bytes Eric Wong
@ 2016-04-12 21:18     ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-12 21:18 UTC (permalink / raw)
  To: meta

It confuses the git ident parser and may not be a great
idea to fix in git since it could break interopability
with older versions.
---
 lib/PublicInbox/Import.pm | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 5bae69d..7073c3d 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -139,6 +139,12 @@ sub add {
 	my $mid = mid_mime($mime);
 	my $path = mid2path($mid);
 
+	# git gets confused with:
+	#  "'A U Thor <u@example.com>' via foo" <foo@example.com>
+	# ref:
+	# <CAD0k6qSUYANxbjjbE4jTW4EeVwOYgBD=bXkSu=akiYC_CB7Ffw@mail.gmail.com>
+	$name =~ s/<([^>]+)>/($1)/g;
+
 	my ($r, $w) = $self->gfi_start;
 	my $tip = $self->{tip};
 	if ($tip ne '') {
-- 
EW

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-12 21:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-09  9:14 [PATCH 0/3] initial importer work Eric Wong
2016-04-09  9:14 ` [PATCH 1/3] git: add support for qx wrapper Eric Wong
2016-04-09  9:14 ` [PATCH 2/3] import: initial module + test case Eric Wong
2016-04-09  9:14 ` [PATCH 3/3] import: set binmode before printing author names Eric Wong
2016-04-11  5:20   ` [PATCH 4/3] import: use bytes::length for true data length in bytes Eric Wong
2016-04-12 21:18     ` [PATCH 5/3] import: filter out [<>] from user names Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).