user/dev discussion of public-inbox itself
 help / color / mirror / Atom feed
* [PATCH 0/7] gcf2: libgit2-based cat-file alternative
@ 2020-09-19  9:37 Eric Wong
  2020-09-19  9:37 ` [PATCH 1/7] gcf2: libgit2-based git " Eric Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta

This allows a single cat-file-like process to handle multiple
inboxes; instead of having a "git cat-file --batch" process for
every inbox; saving pipes and process table space.

The preliminary code was done months ago, but I struggled to put
all the pieces together in a coherent way.  My brain has been
scattered :x

I finally decided to make the gcf2 process a global singleton
(per-worker) to avoid complexity elsewhere in the config...

It doesn't detect or release unlinked packs + indices, yet,
so "git gc" may not free disk space until restarted.

Otherwise it does detect new epochs and seems mostly working
otherwise...

Eric Wong (7):
  gcf2: libgit2-based git cat-file alternative
  t/gcf2: test changes to alternates
  add gcf2 client and executable script
  gcf2: transparently retry on missing OID
  gcf2*: more descriptive package descriptions
  gcf2: require git dir with OID
  gcf2: wire up read-only daemons and rm -gcf2 script

 MANIFEST                       |   5 +
 lib/PublicInbox/Daemon.pm      |  11 +++
 lib/PublicInbox/Gcf2.pm        |  89 ++++++++++++++++++
 lib/PublicInbox/Gcf2Client.pm  |  62 +++++++++++++
 lib/PublicInbox/Git.pm         |  41 ++++++---
 lib/PublicInbox/GitAsyncCat.pm |  73 +++++++++++++--
 lib/PublicInbox/IMAP.pm        |   2 +-
 lib/PublicInbox/gcf2_libgit2.h | 142 +++++++++++++++++++++++++++++
 script/public-inbox-httpd      |   1 +
 t/gcf2.t                       | 162 +++++++++++++++++++++++++++++++++
 t/gcf2_client.t                |  90 ++++++++++++++++++
 11 files changed, 652 insertions(+), 26 deletions(-)
 create mode 100644 lib/PublicInbox/Gcf2.pm
 create mode 100644 lib/PublicInbox/Gcf2Client.pm
 create mode 100644 lib/PublicInbox/gcf2_libgit2.h
 create mode 100644 t/gcf2.t
 create mode 100644 t/gcf2_client.t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/7] gcf2: libgit2-based git cat-file alternative
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 2/7] t/gcf2: test changes to alternates Eric Wong
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong

From: Eric Wong <e@yhbt.net>

Having tens of thousands of inboxes and associated git processes
won't work well, so we'll use libgit2 to access the object DB
directly.  We only care about OID lookups and won't need to rely
on per-repo revision names or paths.

The Git::Raw XS package won't be used since its manpages don't
promise a stable API.  Since we already use Inline::C and have
experience with I::C when it comes to compatibility, this only
introduces libgit2 itself as a source of new incompatibilities.

This also provides an excuse for me to writev(2) to reduce
syscalls, but liburing is on the horizon for next year.
---
 MANIFEST                       |   3 +
 lib/PublicInbox/Gcf2.pm        |  56 +++++++++++++
 lib/PublicInbox/gcf2_libgit2.h | 139 +++++++++++++++++++++++++++++++++
 t/gcf2.t                       | 112 ++++++++++++++++++++++++++
 4 files changed, 310 insertions(+)
 create mode 100644 lib/PublicInbox/Gcf2.pm
 create mode 100644 lib/PublicInbox/gcf2_libgit2.h
 create mode 100644 t/gcf2.t

diff --git a/MANIFEST b/MANIFEST
index 04a3744f..0d3a7073 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -128,6 +128,7 @@ lib/PublicInbox/Filter/Mirror.pm
 lib/PublicInbox/Filter/RubyLang.pm
 lib/PublicInbox/Filter/SubjectTag.pm
 lib/PublicInbox/Filter/Vger.pm
+lib/PublicInbox/Gcf2.pm
 lib/PublicInbox/GetlineBody.pm
 lib/PublicInbox/Git.pm
 lib/PublicInbox/GitAsyncCat.pm
@@ -212,6 +213,7 @@ lib/PublicInbox/WwwStatic.pm
 lib/PublicInbox/WwwStream.pm
 lib/PublicInbox/WwwText.pm
 lib/PublicInbox/Xapcmd.pm
+lib/PublicInbox/gcf2_libgit2.h
 sa_config/Makefile
 sa_config/README
 sa_config/root/etc/spamassassin/public-inbox.pre
@@ -275,6 +277,7 @@ t/filter_mirror.t
 t/filter_rubylang.t
 t/filter_subjecttag.t
 t/filter_vger.t
+t/gcf2.t
 t/git-http-backend.psgi
 t/git.fast-import-data
 t/git.t
diff --git a/lib/PublicInbox/Gcf2.pm b/lib/PublicInbox/Gcf2.pm
new file mode 100644
index 00000000..6ac3aa18
--- /dev/null
+++ b/lib/PublicInbox/Gcf2.pm
@@ -0,0 +1,56 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+#
+# git-cat-file based on libgit2
+package PublicInbox::Gcf2;
+use strict;
+use PublicInbox::Spawn qw(which popen_rd);
+use Fcntl qw(LOCK_EX);
+my (%CFG, $c_src, $lockfh);
+BEGIN {
+	# PublicInbox::Spawn will set PERL_INLINE_DIRECTORY
+	# to ~/.cache/public-inbox/inline-c if it exists
+	my $inline_dir = $ENV{PERL_INLINE_DIRECTORY} //
+		die 'PERL_INLINE_DIRECTORY not defined';
+	my $f = "$inline_dir/.public-inbox.lock";
+	open $lockfh, '>', $f or die "failed to open $f: $!\n";
+	my $pc = which($ENV{PKG_CONFIG} // 'pkg-config');
+	my ($dir) = (__FILE__ =~ m!\A(.+?)/[^/]+\z!);
+	my $rdr = {};
+	open $rdr->{2}, '>', '/dev/null' or die "open /dev/null: $!";
+	for my $x (qw(libgit2)) {
+		my $l = popen_rd([$pc, '--libs', $x], undef, $rdr);
+		$l = do { local $/; <$l> };
+		next if $?;
+		my $c = popen_rd([$pc, '--cflags', $x], undef, $rdr);
+		$c = do { local $/; <$c> };
+		next if $?;
+
+		# note: we name C source files .h to prevent
+		# ExtUtils::MakeMaker from automatically trying to
+		# build them.
+		my $f = "$dir/gcf2_$x.h";
+		if (open(my $fh, '<', $f)) {
+			chomp($l, $c);
+			local $/;
+			$c_src = <$fh>;
+			$CFG{LIBS} = $l;
+			$CFG{CCFLAGSEX} = $c;
+			last;
+		} else {
+			die "E: $f: $!\n";
+		}
+	}
+	die "E: libgit2 not installed\n" unless $c_src;
+
+	# CentOS 7.x ships Inline 0.53, 0.64+ has built-in locking
+	flock($lockfh, LOCK_EX) or die "LOCK_EX failed on $f: $!\n";
+}
+
+# we use Capitalized and ALLCAPS for compatibility with old Inline::C
+use Inline C => Config => %CFG, BOOT => 'git_libgit2_init();';
+use Inline C => $c_src;
+undef $c_src;
+undef %CFG;
+undef $lockfh;
+1;
diff --git a/lib/PublicInbox/gcf2_libgit2.h b/lib/PublicInbox/gcf2_libgit2.h
new file mode 100644
index 00000000..d9c79cf9
--- /dev/null
+++ b/lib/PublicInbox/gcf2_libgit2.h
@@ -0,0 +1,139 @@
+/*
+ * Copyright (C) 2020 all contributors <meta@public-inbox.org>
+ * License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+ *
+ * libgit2 for Inline::C
+ * Avoiding Git::Raw since it doesn't guarantee a stable API,
+ * while libgit2 itself seems reasonably stable.
+ */
+#include <git2.h>
+#include <sys/uio.h>
+#include <errno.h>
+#include <poll.h>
+
+static void croak_if_err(int rc, const char *msg)
+{
+	if (rc != GIT_OK) {
+		const git_error *e = giterr_last();
+
+		croak("%d %s (%s)", rc, msg, e ? e->message : "unknown");
+	}
+}
+
+SV *new()
+{
+	git_odb *odb;
+	SV *ref, *self;
+	int rc = git_odb_new(&odb);
+	croak_if_err(rc, "git_odb_new");
+
+	ref = newSViv((IV)odb);
+	self = newRV_noinc(ref);
+	sv_bless(self, gv_stashpv("PublicInbox::Gcf2", GV_ADD));
+	SvREADONLY_on(ref);
+
+	return self;
+}
+
+static git_odb *odb_ptr(SV *self)
+{
+	return (git_odb *)SvIV(SvRV(self));
+}
+
+void DESTROY(SV *self)
+{
+	git_odb_free(odb_ptr(self));
+}
+
+/* needs "$GIT_DIR/objects", not $GIT_DIR */
+void add_alternate(SV *self, const char *objects_path)
+{
+	int rc = git_odb_add_disk_alternate(odb_ptr(self), objects_path);
+	croak_if_err(rc, "git_odb_add_disk_alternate");
+}
+
+/* this requires an unabbreviated git OID */
+#define CAPA(v) (sizeof(v) / sizeof((v)[0]))
+void cat_oid(SV *self, int fd, SV *oidsv)
+{
+	/*
+	 * adjust when libgit2 gets SHA-256 support, we return the
+	 * same header as git-cat-file --batch "$OID $TYPE $SIZE\n"
+	 */
+	char hdr[GIT_OID_HEXSZ + sizeof(" commit 18446744073709551615")];
+	struct iovec vec[3];
+	size_t nvec = CAPA(vec);
+	git_oid oid;
+	git_odb_object *object = NULL;
+	int rc, err = 0;
+	STRLEN oidlen;
+	char *oidptr = SvPV(oidsv, oidlen);
+
+	/* same trailer as git-cat-file --batch */
+	vec[2].iov_len = 1;
+	vec[2].iov_base = "\n";
+
+	rc = git_oid_fromstrn(&oid, oidptr, oidlen);
+	if (rc == GIT_OK)
+		rc = git_odb_read(&object, odb_ptr(self), &oid);
+	if (rc == GIT_OK) {
+		vec[0].iov_base = hdr;
+		vec[1].iov_base = (void *)git_odb_object_data(object);
+		vec[1].iov_len = git_odb_object_size(object);
+
+		git_oid_nfmt(hdr, GIT_OID_HEXSZ, git_odb_object_id(object));
+		vec[0].iov_len = GIT_OID_HEXSZ +
+				snprintf(hdr + GIT_OID_HEXSZ,
+					sizeof(hdr) - GIT_OID_HEXSZ,
+					" %s %zu\n",
+					git_object_type2string(
+						git_odb_object_type(object)),
+					vec[1].iov_len);
+	} else {
+		vec[0].iov_base = oidptr;
+		vec[0].iov_len = oidlen;
+		vec[1].iov_base = " missing";
+		vec[1].iov_len = strlen(vec[1].iov_base);
+	}
+	while (nvec && !err) {
+		ssize_t w = writev(fd, vec + CAPA(vec) - nvec, nvec);
+
+		if (w > 0) {
+			size_t done = 0;
+			size_t i;
+
+			for (i = CAPA(vec) - nvec; i < CAPA(vec); i++) {
+				if (w >= vec[i].iov_len) {
+					/* fully written vec */
+					w -= vec[i].iov_len;
+					done++;
+				} else { /* partially written vec */
+					char *p = vec[i].iov_base;
+					vec[i].iov_base = p + w;
+					vec[i].iov_len -= w;
+					break;
+				}
+			}
+			nvec -= done;
+		} else if (w < 0) {
+			err = errno;
+			switch (err) {
+			case EAGAIN: {
+				struct pollfd pfd;
+				pfd.events = POLLOUT;
+				pfd.fd = fd;
+				poll(&pfd, 1, -1);
+			}
+				/* fall-through */
+			case EINTR:
+				err = 0;
+			}
+		} else { /* w == 0 */
+			err = ENOSPC;
+		}
+	}
+	if (object)
+		git_odb_object_free(object);
+	if (err)
+		croak("writev error: %s", strerror(err));
+}
diff --git a/t/gcf2.t b/t/gcf2.t
new file mode 100644
index 00000000..c67efb6c
--- /dev/null
+++ b/t/gcf2.t
@@ -0,0 +1,112 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use PublicInbox::TestCommon;
+use Test::More;
+use Fcntl qw(:seek);
+use IO::Handle ();
+use POSIX qw(_exit);
+require_mods('PublicInbox::Gcf2');
+use_ok 'PublicInbox::Gcf2';
+my $gcf2 = PublicInbox::Gcf2::new();
+is(ref($gcf2), 'PublicInbox::Gcf2', '::new works');
+chomp(my $objdir = xqx([qw(git rev-parse --git-path objects)]));
+if ($objdir =~ /\A--git-path\n/) { # git <2.5
+	chomp($objdir = xqx([qw(git rev-parse --git-dir)]));
+	$objdir .= '/objects';
+	$objdir = undef unless -d $objdir;
+}
+
+my $COPYING = 'dba13ed2ddf783ee8118c6a581dbf75305f816a3';
+open my $agpl, '<', 'COPYING' or BAIL_OUT "AGPL-3 missing: $!";
+$agpl = do { local $/; <$agpl> };
+
+SKIP: {
+	skip 'not in git worktree', 15 unless defined($objdir);
+	$gcf2->add_alternate($objdir);
+	open my $fh, '+>', undef or BAIL_OUT "open: $!";
+	my $fd = fileno($fh);
+	$fh->autoflush(1);
+
+	$gcf2->cat_oid($fd, 'invalid');
+	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+	is(do { local $/; <$fh> }, "invalid missing\n", 'got missing message');
+
+	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+	$gcf2->cat_oid($fd, '0'x40);
+	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+	is(do { local $/; <$fh> }, ('0'x40)." missing\n",
+		'got missing message for 0x40');
+
+	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+	$gcf2->cat_oid($fd, $COPYING);
+	my $buf;
+	my $ck_copying = sub {
+		my ($desc) = @_;
+		seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+		is(<$fh>, "$COPYING blob 34520\n", 'got expected header');
+		$buf = do { local $/; <$fh> };
+		is(chop($buf), "\n", 'got trailing \\n');
+		is($buf, $agpl, "AGPL matches ($desc)");
+	};
+	$ck_copying->('regular file');
+
+	$^O eq 'linux' or skip('pipe tests are Linux-only', 12);
+	my $size = -s $fh;
+	for my $blk (1, 0) {
+		my ($r, $w);
+		pipe($r, $w) or BAIL_OUT $!;
+		fcntl($w, 1031, 4096) or
+			skip('Linux too old for F_SETPIPE_SZ', 12);
+		$w->blocking($blk);
+		seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
+		truncate($fh, 0) or BAIL_OUT "truncate: $!";
+		defined(my $pid = fork) or BAIL_OUT "fork: $!";
+		if ($pid == 0) {
+			close $w;
+			tick; # wait for parent to block on writev
+			$buf = do { local $/; <$r> };
+			print $fh $buf or _exit(1);
+			_exit(0);
+		}
+		$gcf2->cat_oid(fileno($w), $COPYING);
+		close $w or BAIL_OUT "close: $!";
+		is(waitpid($pid, 0), $pid, 'child exited');
+		is($?, 0, 'no error in child');
+		$ck_copying->("pipe blocking($blk)");
+
+		pipe($r, $w) or BAIL_OUT $!;
+		fcntl($w, 1031, 4096) or BAIL_OUT $!;
+		$w->blocking($blk);
+		close $r;
+		local $SIG{PIPE} = 'IGNORE';
+		eval { $gcf2->cat_oid(fileno($w), $COPYING) };
+		like($@, qr/writev error:/, 'got writev error');
+	}
+}
+
+if (my $nr = $ENV{TEST_LEAK_NR}) {
+	open my $null, '>', '/dev/null' or BAIL_OUT "open /dev/null: $!";
+	my $fd = fileno($null);
+	my $cat = $ENV{TEST_LEAK_CAT} // 10;
+	diag "checking for leaks... (TEST_LEAK_NR=$nr TEST_LEAK_CAT=$cat)";
+	local $SIG{PIPE} = 'IGNORE';
+	my ($r, $w);
+	pipe($r, $w);
+	close $r;
+	my $broken = fileno($w);
+	for (1..$nr) {
+		my $obj = PublicInbox::Gcf2::new();
+		if (defined($objdir)) {
+			$obj->add_alternate($objdir);
+			for (1..$cat) {
+				$obj->cat_oid($fd, $COPYING);
+				eval { $obj->cat_oid($broken, $COPYING) };
+				$obj->cat_oid($fd, '0'x40);
+				$obj->cat_oid($fd, 'invalid');
+			}
+		}
+	}
+}
+done_testing;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/7] t/gcf2: test changes to alternates
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
  2020-09-19  9:37 ` [PATCH 1/7] gcf2: libgit2-based git " Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 3/7] add gcf2 client and executable script Eric Wong
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong

From: Eric Wong <e@yhbt.net>

Calling ->add_alternate won't pick up new additions to
$OBJDIR/info/alternates, unfornately.  Thus v2 inboxes will
need to do something to invalidate Gcf2 objects.
---
 t/gcf2.t | 68 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/t/gcf2.t b/t/gcf2.t
index c67efb6c..9056b340 100644
--- a/t/gcf2.t
+++ b/t/gcf2.t
@@ -7,24 +7,74 @@ use Test::More;
 use Fcntl qw(:seek);
 use IO::Handle ();
 use POSIX qw(_exit);
+use Cwd qw(abs_path);
 require_mods('PublicInbox::Gcf2');
 use_ok 'PublicInbox::Gcf2';
+use PublicInbox::Import;
+my ($tmpdir, $for_destroy) = tmpdir();
+
 my $gcf2 = PublicInbox::Gcf2::new();
 is(ref($gcf2), 'PublicInbox::Gcf2', '::new works');
+my $COPYING = 'dba13ed2ddf783ee8118c6a581dbf75305f816a3';
+open my $agpl, '<', 'COPYING' or BAIL_OUT "AGPL-3 missing: $!";
+$agpl = do { local $/; <$agpl> };
+
+PublicInbox::Import::init_bare($tmpdir);
+my $fi_data = './t/git.fast-import-data';
+my $rdr = {};
+open $rdr->{0}, '<', $fi_data or BAIL_OUT $!;
+xsys([qw(git fast-import --quiet)], { GIT_DIR => $tmpdir }, $rdr);
+is($?, 0, 'fast-import succeeded');
+$gcf2->add_alternate("$tmpdir/objects");
+
+{
+	my ($r, $w);
+	pipe($r, $w) or BAIL_OUT $!;
+	my $tree = 'fdbc43725f21f485051c17463b50185f4c3cf88c';
+	$gcf2->cat_oid(fileno($w), $tree);
+	close $w;
+	is("$tree tree 30\n", <$r>, 'tree header ok');
+	$r = do { local $/; <$r> };
+	is(chop($r), "\n", 'got trailing newline');
+	is(length($r), 30, 'tree length matches');
+}
+
 chomp(my $objdir = xqx([qw(git rev-parse --git-path objects)]));
 if ($objdir =~ /\A--git-path\n/) { # git <2.5
 	chomp($objdir = xqx([qw(git rev-parse --git-dir)]));
 	$objdir .= '/objects';
-	$objdir = undef unless -d $objdir;
 }
+if ($objdir && -d $objdir) {
+	$objdir = abs_path($objdir);
+	open my $alt, '>>', "$tmpdir/objects/info/alternates" or
+							BAIL_OUT $!;
+	print $alt $objdir, "\n" or BAIL_OUT $!;
+	close $alt or BAIL_OUT $!;
 
-my $COPYING = 'dba13ed2ddf783ee8118c6a581dbf75305f816a3';
-open my $agpl, '<', 'COPYING' or BAIL_OUT "AGPL-3 missing: $!";
-$agpl = do { local $/; <$agpl> };
+	# calling gcf2->add_alternate on an already-added path won't
+	# cause alternates to be reloaded, so we do
+	# $gcf2->add_alternate($objdir) later on instead of
+	# $gcf2->add_alternate("$tmpdir/objects");
+	# $objdir = "$tmpdir/objects";
+} else {
+	$objdir = undef
+}
+
+my $nr = $ENV{TEST_LEAK_NR};
+my $cat = $ENV{TEST_LEAK_CAT} // 10;
+diag "checking for leaks... (TEST_LEAK_NR=$nr TEST_LEAK_CAT=$cat)" if $nr;
 
 SKIP: {
-	skip 'not in git worktree', 15 unless defined($objdir);
+	skip 'not in git worktree', 21 unless defined($objdir);
 	$gcf2->add_alternate($objdir);
+	eval { $gcf2->add_alternate($objdir) };
+	ok(!$@, 'no error adding alternate redundantly');
+	if ($nr) {
+		diag "adding alternate $nr times redundantly";
+		$gcf2->add_alternate($objdir) for (1..$nr);
+		diag 'done adding redundant alternates';
+	}
+
 	open my $fh, '+>', undef or BAIL_OUT "open: $!";
 	my $fd = fileno($fh);
 	$fh->autoflush(1);
@@ -52,6 +102,10 @@ SKIP: {
 	};
 	$ck_copying->('regular file');
 
+	$gcf2 = PublicInbox::Gcf2::new();
+	$gcf2->add_alternate("$tmpdir/objects");
+	$ck_copying->('alternates respected');
+
 	$^O eq 'linux' or skip('pipe tests are Linux-only', 12);
 	my $size = -s $fh;
 	for my $blk (1, 0) {
@@ -86,11 +140,9 @@ SKIP: {
 	}
 }
 
-if (my $nr = $ENV{TEST_LEAK_NR}) {
+if ($nr) {
 	open my $null, '>', '/dev/null' or BAIL_OUT "open /dev/null: $!";
 	my $fd = fileno($null);
-	my $cat = $ENV{TEST_LEAK_CAT} // 10;
-	diag "checking for leaks... (TEST_LEAK_NR=$nr TEST_LEAK_CAT=$cat)";
 	local $SIG{PIPE} = 'IGNORE';
 	my ($r, $w);
 	pipe($r, $w);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/7] add gcf2 client and executable script
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
  2020-09-19  9:37 ` [PATCH 1/7] gcf2: libgit2-based git " Eric Wong
  2020-09-19  9:37 ` [PATCH 2/7] t/gcf2: test changes to alternates Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 4/7] gcf2: transparently retry on missing OID Eric Wong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong

From: Eric Wong <e@yhbt.net>

This should be able to replace multiple `git cat-file' for blob
retrieval, but adjustments may be needed.
---
 Documentation/public-inbox-gcf2.pod | 63 +++++++++++++++++++++++++++++
 MANIFEST                            |  4 ++
 Makefile.PL                         |  5 +++
 lib/PublicInbox/Gcf2Client.pm       | 35 ++++++++++++++++
 script/public-inbox-gcf2            | 14 +++++++
 t/gcf2_client.t                     | 47 +++++++++++++++++++++
 6 files changed, 168 insertions(+)
 create mode 100644 Documentation/public-inbox-gcf2.pod
 create mode 100644 lib/PublicInbox/Gcf2Client.pm
 create mode 100755 script/public-inbox-gcf2
 create mode 100644 t/gcf2_client.t

diff --git a/Documentation/public-inbox-gcf2.pod b/Documentation/public-inbox-gcf2.pod
new file mode 100644
index 00000000..813fbe7f
--- /dev/null
+++ b/Documentation/public-inbox-gcf2.pod
@@ -0,0 +1,63 @@
+=head1 NAME
+
+public-inbox-gcf2 - internal libgit2-based blob retriever
+
+=head1 SYNOPSIS
+
+	This is an internal command used by public-inbox.
+	It may change unrecognizably or cease to exist at some point
+
+=head1 DESCRIPTION
+
+public-inbox-gcf2 is an optional internal process used by
+public-inbox daemons for read-only access to underlying git
+repositories.
+
+Users are NOT expected to run public-inbox-gcf2 on their own.
+It replaces multiple C<git cat-file --batch> processes by treating
+any git repos it knows about as alternates.
+
+None of its behaviors are stable and it is ALL subject to change
+at any time.
+
+Any lines written to its standard input prefixed with a C</>
+are interpreted as a git directory.  That git directory
+will be suffixed with "/objects" and treated as an alternate.
+It writes nothing to stdout in this case.
+
+Otherwise it behaves like C<git cat-file --batch>, but only accepts
+unabbreviated hexadecimal object IDs in its standard input.
+Its output format is identical to C<git cat-file --batch>.  It
+only works for L<public-inbox-v2-format(5)> inboxes and v1
+inboxes indexed by L<public-inbox-index(1)>.
+
+=head1 OPTIONS
+
+=head1 ENVIRONMENT
+
+=over 8
+
+=item PERL_INLINE_DIRECTORY
+
+This must be set unless C<~/.cache/public-inbox/inline-c>
+exists.  C<public-inbox-gcf2> uses L<Inline::C> and libgit2
+and compiles a small shim on its first run.
+
+=back
+
+=head1 CONTACT
+
+Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
+
+The mail archives are hosted at L<https://public-inbox.org/meta/>
+and L<http://hjrcffqmbrq6wope.onion/meta/>
+
+=head1 COPYRIGHT
+
+Copyright 2020 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<git-cat-file(1)>
diff --git a/MANIFEST b/MANIFEST
index 0d3a7073..91457dab 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -26,6 +26,7 @@ Documentation/public-inbox-config.pod
 Documentation/public-inbox-convert.pod
 Documentation/public-inbox-daemon.pod
 Documentation/public-inbox-edit.pod
+Documentation/public-inbox-gcf2.pod
 Documentation/public-inbox-httpd.pod
 Documentation/public-inbox-imapd.pod
 Documentation/public-inbox-index.pod
@@ -129,6 +130,7 @@ lib/PublicInbox/Filter/RubyLang.pm
 lib/PublicInbox/Filter/SubjectTag.pm
 lib/PublicInbox/Filter/Vger.pm
 lib/PublicInbox/Gcf2.pm
+lib/PublicInbox/Gcf2Client.pm
 lib/PublicInbox/GetlineBody.pm
 lib/PublicInbox/Git.pm
 lib/PublicInbox/GitAsyncCat.pm
@@ -221,6 +223,7 @@ sa_config/user/.spamassassin/user_prefs
 script/public-inbox-compact
 script/public-inbox-convert
 script/public-inbox-edit
+script/public-inbox-gcf2
 script/public-inbox-httpd
 script/public-inbox-imapd
 script/public-inbox-index
@@ -278,6 +281,7 @@ t/filter_rubylang.t
 t/filter_subjecttag.t
 t/filter_vger.t
 t/gcf2.t
+t/gcf2_client.t
 t/git-http-backend.psgi
 t/git.fast-import-data
 t/git.t
diff --git a/Makefile.PL b/Makefile.PL
index 3fe9acf8..5a268362 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -71,6 +71,11 @@ $v->{gz_docs} = [ map { "$_.gz" } (@{$v->{docs}},@{$v->{docs_html}}) ];
 $v->{rsync_docs} = [ @{$v->{gz_docs}}, @{$v->{docs}},
 	@{$v->{docs_html}}, qw(NEWS.atom NEWS.atom.gz)];
 
+# filter out public-inbox-gcf2 from the website, it's an internal command
+for my $var (qw(gz_docs rsync_docs)) {
+	@{$v->{$var}} = grep(!/-gcf2/, @{$v->{$var}});
+}
+
 # external manpages which we host ourselves, since some packages
 # (currently just Xapian) doesn't host manpages themselves.
 my @xman = qw(copydatabase.1 xapian-compact.1);
diff --git a/lib/PublicInbox/Gcf2Client.pm b/lib/PublicInbox/Gcf2Client.pm
new file mode 100644
index 00000000..71fbb1d1
--- /dev/null
+++ b/lib/PublicInbox/Gcf2Client.pm
@@ -0,0 +1,35 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+package PublicInbox::Gcf2Client;
+use strict;
+use parent 'PublicInbox::Git';
+use PublicInbox::Spawn qw(popen_rd);
+use IO::Handle ();
+
+sub new {
+	my $self = shift->SUPER::new('/nonexistent');
+	my ($out_r, $out_w);
+	pipe($out_r, $out_w) or $self->fail("pipe failed: $!");
+	my $cmd = [ 'public-inbox-gcf2' ];
+	@$self{qw(in pid)} = popen_rd($cmd, undef, { 0 => $out_r });
+	$self->{inflight} = [];
+	$self->{out} = $out_w;
+	fcntl($out_w, 1031, 4096) if $^O eq 'linux'; # 1031: F_SETPIPE_SZ
+	$out_w->autoflush(1);
+	$self;
+}
+
+sub add_git_dir {
+	my ($self, $git_dir) = @_;
+
+	# ensure buffers are drained, length($git_dir) may exceed
+	# PIPE_BUF on platforms where PIPE_BUF is only 512 bytes
+	my $inflight = $self->{inflight};
+	while (scalar(@$inflight)) {
+		$self->cat_async_step($inflight);
+	}
+	print { $self->{out} } $git_dir, "\n" or
+				$self->fail("write error: $!");
+}
+
+1;
diff --git a/script/public-inbox-gcf2 b/script/public-inbox-gcf2
new file mode 100755
index 00000000..51811698
--- /dev/null
+++ b/script/public-inbox-gcf2
@@ -0,0 +1,14 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+eval { require PublicInbox::Gcf2 };
+die "libgit2 development package or Inline::C missing for $0: $@\n" if $@;
+my $gcf2 = PublicInbox::Gcf2::new();
+while (<STDIN>) {
+	chomp;
+	if (m!\A/!) { # +/path/to/git-dir
+		$gcf2->add_alternate("$_/objects");
+	} else {
+		$gcf2->cat_oid(1, $_);
+	}
+}
diff --git a/t/gcf2_client.t b/t/gcf2_client.t
new file mode 100644
index 00000000..39f9f296
--- /dev/null
+++ b/t/gcf2_client.t
@@ -0,0 +1,47 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use PublicInbox::TestCommon;
+use Test::More;
+use Cwd qw(getcwd);
+use PublicInbox::Import;
+
+require_mods('PublicInbox::Gcf2');
+use_ok 'PublicInbox::Gcf2Client';
+my ($tmpdir, $for_destroy) = tmpdir();
+PublicInbox::Import::init_bare($tmpdir);
+my $fi_data = './t/git.fast-import-data';
+my $rdr = {};
+open $rdr->{0}, '<', $fi_data or BAIL_OUT $!;
+xsys([qw(git fast-import --quiet)], { GIT_DIR => $tmpdir }, $rdr);
+is($?, 0, 'fast-import succeeded');
+
+my $tree = 'fdbc43725f21f485051c17463b50185f4c3cf88c';
+my $called = 0;
+{
+	local $ENV{PATH} = getcwd()."/blib/script:$ENV{PATH}";
+	my $gcf2c = PublicInbox::Gcf2Client->new;
+	$gcf2c->add_git_dir($tmpdir);
+	$gcf2c->cat_async($tree, sub {
+		my ($bref, $oid, $type, $size, $arg) = @_;
+		is($oid, $tree, 'got expected OID');
+		is($size, 30, 'got expected length');
+		is($type, 'tree', 'got tree type');
+		is(length($$bref), 30, 'got a tree');
+		is($arg, 'hi', 'arg passed');
+		$called++;
+	}, 'hi');
+	my $trunc = substr($tree, 0, 39);
+	$gcf2c->cat_async($trunc, sub {
+		my ($bref, $oid, $type, $size, $arg) = @_;
+		is(undef, $bref, 'missing bref is undef');
+		is($oid, $trunc, 'truncated OID printed');
+		is($type, 'missing', 'type is "missing"');
+		is($size, undef, 'size is undef');
+		is($arg, 'bye', 'arg passed when missing');
+		$called++;
+	}, 'bye');
+}
+is($called, 2, 'cat_async callbacks hit');
+done_testing;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/7] gcf2: transparently retry on missing OID
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
                   ` (2 preceding siblings ...)
  2020-09-19  9:37 ` [PATCH 3/7] add gcf2 client and executable script Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 5/7] gcf2*: more descriptive package descriptions Eric Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta

Since we only get OIDs from trusted local data sources
(over.sqlite3), we can safely retry within the -gcf2 process
without worry about clients spamming us with requests for
invalid OIDs and triggering reopens.
---
 lib/PublicInbox/Gcf2Client.pm  | 11 +++++--
 lib/PublicInbox/Git.pm         |  5 ++--
 lib/PublicInbox/gcf2_libgit2.h | 17 ++++++-----
 script/public-inbox-gcf2       | 25 ++++++++++++++--
 t/gcf2.t                       | 34 ++++++++++-----------
 t/gcf2_client.t                | 54 ++++++++++++++++++++++++++++++----
 6 files changed, 109 insertions(+), 37 deletions(-)

diff --git a/lib/PublicInbox/Gcf2Client.pm b/lib/PublicInbox/Gcf2Client.pm
index 71fbb1d1..5120698f 100644
--- a/lib/PublicInbox/Gcf2Client.pm
+++ b/lib/PublicInbox/Gcf2Client.pm
@@ -7,11 +7,13 @@ use PublicInbox::Spawn qw(popen_rd);
 use IO::Handle ();
 
 sub new {
-	my $self = shift->SUPER::new('/nonexistent');
+	my ($rdr) = @_;
+	my $self = bless {}, __PACKAGE__;
 	my ($out_r, $out_w);
 	pipe($out_r, $out_w) or $self->fail("pipe failed: $!");
-	my $cmd = [ 'public-inbox-gcf2' ];
-	@$self{qw(in pid)} = popen_rd($cmd, undef, { 0 => $out_r });
+	$rdr //= {};
+	$rdr->{0} = $out_r;
+	@$self{qw(in pid)} = popen_rd(['public-inbox-gcf2'], undef, $rdr);
 	$self->{inflight} = [];
 	$self->{out} = $out_w;
 	fcntl($out_w, 1031, 4096) if $^O eq 'linux'; # 1031: F_SETPIPE_SZ
@@ -32,4 +34,7 @@ sub add_git_dir {
 				$self->fail("write error: $!");
 }
 
+# always false, since -gcf2 retries internally
+sub alternates_changed {}
+
 1;
diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index a7ba57f9..b49b5bd3 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -192,7 +192,8 @@ sub cat_async_step ($$) {
 		chop($$bref) eq "\n" or fail($self, 'LF missing after blob');
 	} elsif ($head =~ / missing$/) {
 		# ref($req) indicates it's already been retried
-		if (!ref($req) && !$in_cleanup && alternates_changed($self)) {
+		# -gcf2 retries internally, so it never hits this path:
+		if (!ref($req) && !$in_cleanup && $self->alternates_changed) {
 			return cat_async_retry($self, $inflight,
 						$req, $cb, $arg);
 		}
@@ -394,7 +395,7 @@ sub pub_urls {
 
 sub cat_async_begin {
 	my ($self) = @_;
-	cleanup($self) if alternates_changed($self);
+	cleanup($self) if $self->alternates_changed;
 	batch_prepare($self);
 	die 'BUG: already in async' if $self->{inflight};
 	$self->{inflight} = [];
diff --git a/lib/PublicInbox/gcf2_libgit2.h b/lib/PublicInbox/gcf2_libgit2.h
index d9c79cf9..800c6bad 100644
--- a/lib/PublicInbox/gcf2_libgit2.h
+++ b/lib/PublicInbox/gcf2_libgit2.h
@@ -52,9 +52,13 @@ void add_alternate(SV *self, const char *objects_path)
 	croak_if_err(rc, "git_odb_add_disk_alternate");
 }
 
-/* this requires an unabbreviated git OID */
 #define CAPA(v) (sizeof(v) / sizeof((v)[0]))
-void cat_oid(SV *self, int fd, SV *oidsv)
+
+/*
+ * returns true on success, false on failure
+ * this requires an unabbreviated git OID
+ */
+int cat_oid(SV *self, int fd, SV *oidsv)
 {
 	/*
 	 * adjust when libgit2 gets SHA-256 support, we return the
@@ -89,11 +93,8 @@ void cat_oid(SV *self, int fd, SV *oidsv)
 					git_object_type2string(
 						git_odb_object_type(object)),
 					vec[1].iov_len);
-	} else {
-		vec[0].iov_base = oidptr;
-		vec[0].iov_len = oidlen;
-		vec[1].iov_base = " missing";
-		vec[1].iov_len = strlen(vec[1].iov_base);
+	} else { /* caller retries */
+		nvec = 0;
 	}
 	while (nvec && !err) {
 		ssize_t w = writev(fd, vec + CAPA(vec) - nvec, nvec);
@@ -136,4 +137,6 @@ void cat_oid(SV *self, int fd, SV *oidsv)
 		git_odb_object_free(object);
 	if (err)
 		croak("writev error: %s", strerror(err));
+
+	return rc == GIT_OK;
 }
diff --git a/script/public-inbox-gcf2 b/script/public-inbox-gcf2
index 51811698..d2d2ac8b 100755
--- a/script/public-inbox-gcf2
+++ b/script/public-inbox-gcf2
@@ -3,12 +3,33 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 eval { require PublicInbox::Gcf2 };
 die "libgit2 development package or Inline::C missing for $0: $@\n" if $@;
+my @dirs; # may get big (30K-100K)
 my $gcf2 = PublicInbox::Gcf2::new();
+use IO::Handle; # autoflush
+STDERR->autoflush(1);
+STDOUT->autoflush(1);
+
 while (<STDIN>) {
 	chomp;
 	if (m!\A/!) { # +/path/to/git-dir
+		push @dirs, $_;
 		$gcf2->add_alternate("$_/objects");
-	} else {
-		$gcf2->cat_oid(1, $_);
+	} elsif (!$gcf2->cat_oid(1, $_)) {
+		# retry once if missing.  We only get unabbreviated OIDs
+		# from SQLite or Xapian DBs, here, so malicious clients
+		# can't trigger excessive retries:
+		my $oid = $_;
+		warn "I: $$ $oid missing, retrying...\n";
+
+		# clients may need to wait a bit for this:
+		$gcf2 = PublicInbox::Gcf2::new();
+		$gcf2->add_alternate("$_/objects") for @dirs;
+
+		if ($gcf2->cat_oid(1, $oid)) {
+			warn "I: $$ $oid found after retry\n";
+		} else {
+			warn "W: $$ $oid missing after retry\n";
+			print "$oid missing\n"; # mimic git-cat-file
+		}
 	}
 }
diff --git a/t/gcf2.t b/t/gcf2.t
index 9056b340..35b2f113 100644
--- a/t/gcf2.t
+++ b/t/gcf2.t
@@ -76,43 +76,41 @@ SKIP: {
 	}
 
 	open my $fh, '+>', undef or BAIL_OUT "open: $!";
-	my $fd = fileno($fh);
 	$fh->autoflush(1);
 
-	$gcf2->cat_oid($fd, 'invalid');
+	ok(!$gcf2->cat_oid(fileno($fh), 'invalid'), 'invalid fails');
 	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
-	is(do { local $/; <$fh> }, "invalid missing\n", 'got missing message');
+	is(do { local $/; <$fh> }, '', 'nothing written');
 
+	open $fh, '+>', undef or BAIL_OUT "open: $!";
+	ok(!$gcf2->cat_oid(fileno($fh), '0'x40), 'z40 fails');
 	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
-	$gcf2->cat_oid($fd, '0'x40);
-	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
-	is(do { local $/; <$fh> }, ('0'x40)." missing\n",
-		'got missing message for 0x40');
+	is(do { local $/; <$fh> }, '', 'nothing written for z40');
 
-	seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
-	$gcf2->cat_oid($fd, $COPYING);
-	my $buf;
+	open $fh, '+>', undef or BAIL_OUT "open: $!";
 	my $ck_copying = sub {
 		my ($desc) = @_;
 		seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
-		is(<$fh>, "$COPYING blob 34520\n", 'got expected header');
-		$buf = do { local $/; <$fh> };
+		is(<$fh>, "$COPYING blob 34520\n", "got expected header $desc");
+		my $buf = do { local $/; <$fh> };
 		is(chop($buf), "\n", 'got trailing \\n');
 		is($buf, $agpl, "AGPL matches ($desc)");
 	};
+	ok($gcf2->cat_oid(fileno($fh), $COPYING), 'cat_oid normal');
 	$ck_copying->('regular file');
 
 	$gcf2 = PublicInbox::Gcf2::new();
 	$gcf2->add_alternate("$tmpdir/objects");
-	$ck_copying->('alternates respected');
+	open $fh, '+>', undef or BAIL_OUT "open: $!";
+	ok($gcf2->cat_oid(fileno($fh), $COPYING), 'cat_oid alternate');
+	$ck_copying->('alternates after reopen');
 
-	$^O eq 'linux' or skip('pipe tests are Linux-only', 12);
-	my $size = -s $fh;
+	$^O eq 'linux' or skip('pipe tests are Linux-only', 14);
 	for my $blk (1, 0) {
 		my ($r, $w);
 		pipe($r, $w) or BAIL_OUT $!;
 		fcntl($w, 1031, 4096) or
-			skip('Linux too old for F_SETPIPE_SZ', 12);
+			skip('Linux too old for F_SETPIPE_SZ', 14);
 		$w->blocking($blk);
 		seek($fh, 0, SEEK_SET) or BAIL_OUT "seek: $!";
 		truncate($fh, 0) or BAIL_OUT "truncate: $!";
@@ -120,11 +118,11 @@ SKIP: {
 		if ($pid == 0) {
 			close $w;
 			tick; # wait for parent to block on writev
-			$buf = do { local $/; <$r> };
+			my $buf = do { local $/; <$r> };
 			print $fh $buf or _exit(1);
 			_exit(0);
 		}
-		$gcf2->cat_oid(fileno($w), $COPYING);
+		ok($gcf2->cat_oid(fileno($w), $COPYING), "cat blocking=$blk");
 		close $w or BAIL_OUT "close: $!";
 		is(waitpid($pid, 0), $pid, 'child exited');
 		is($?, 0, 'no error in child');
diff --git a/t/gcf2_client.t b/t/gcf2_client.t
index 39f9f296..0f7e7203 100644
--- a/t/gcf2_client.t
+++ b/t/gcf2_client.t
@@ -10,19 +10,25 @@ use PublicInbox::Import;
 require_mods('PublicInbox::Gcf2');
 use_ok 'PublicInbox::Gcf2Client';
 my ($tmpdir, $for_destroy) = tmpdir();
-PublicInbox::Import::init_bare($tmpdir);
+my $git_a = "$tmpdir/a.git";
+my $git_b = "$tmpdir/b.git";
+PublicInbox::Import::init_bare($git_a);
+PublicInbox::Import::init_bare($git_b);
 my $fi_data = './t/git.fast-import-data';
 my $rdr = {};
 open $rdr->{0}, '<', $fi_data or BAIL_OUT $!;
-xsys([qw(git fast-import --quiet)], { GIT_DIR => $tmpdir }, $rdr);
+xsys([qw(git fast-import --quiet)], { GIT_DIR => $git_a }, $rdr);
 is($?, 0, 'fast-import succeeded');
 
 my $tree = 'fdbc43725f21f485051c17463b50185f4c3cf88c';
 my $called = 0;
+my $err_f = "$tmpdir/err";
 {
 	local $ENV{PATH} = getcwd()."/blib/script:$ENV{PATH}";
-	my $gcf2c = PublicInbox::Gcf2Client->new;
-	$gcf2c->add_git_dir($tmpdir);
+	open my $err, '>', $err_f or BAIL_OUT $!;
+	my $gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
+	$gcf2c->add_git_dir($git_a);
+
 	$gcf2c->cat_async($tree, sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
 		is($oid, $tree, 'got expected OID');
@@ -32,6 +38,12 @@ my $called = 0;
 		is($arg, 'hi', 'arg passed');
 		$called++;
 	}, 'hi');
+	$gcf2c->cat_async_wait;
+
+	open $err, '<', $err_f or BAIL_OUT $!;
+	my $estr = do { local $/; <$err> };
+	is($estr, '', 'nothing in stderr');
+
 	my $trunc = substr($tree, 0, 39);
 	$gcf2c->cat_async($trunc, sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
@@ -42,6 +54,38 @@ my $called = 0;
 		is($arg, 'bye', 'arg passed when missing');
 		$called++;
 	}, 'bye');
+	$gcf2c->cat_async_wait;
+
+	open $err, '<', $err_f or BAIL_OUT $!;
+	$estr = do { local $/; <$err> };
+	like($estr, qr/retrying/, 'warned about retry');
+
+	# try failed alternates lookup
+	open $err, '>', $err_f or BAIL_OUT $!;
+	$gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
+	$gcf2c->add_git_dir($git_b);
+	$gcf2c->cat_async($tree, sub {
+		my ($bref, $oid, $type, $size, $arg) = @_;
+		is(undef, $bref, 'missing bref from alt is undef');
+		$called++;
+	});
+	$gcf2c->cat_async_wait;
+	open $err, '<', $err_f or BAIL_OUT $!;
+	$estr = do { local $/; <$err> };
+	like($estr, qr/retrying/, 'warned about retry before alt update');
+
+	# now try successful alternates lookup
+	open my $alt, '>>', "$git_b/objects/info/alternates" or BAIL_OUT $!;
+	print $alt "$git_a/objects\n" or BAIL_OUT $!;
+	close $alt or BAIL_OUT;
+	my $expect = xqx(['git', "--git-dir=$git_a", qw(cat-file tree), $tree]);
+	$gcf2c->cat_async($tree, sub {
+		my ($bref, $oid, $type, $size, $arg) = @_;
+		is($oid, $tree, 'oid match on alternates retry');
+		is($$bref, $expect, 'tree content matched');
+		$called++;
+	});
+	$gcf2c->cat_async_wait;
 }
-is($called, 2, 'cat_async callbacks hit');
+is($called, 4, 'cat_async callbacks hit');
 done_testing;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/7] gcf2*: more descriptive package descriptions
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
                   ` (3 preceding siblings ...)
  2020-09-19  9:37 ` [PATCH 4/7] gcf2: transparently retry on missing OID Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 6/7] gcf2: require git dir with OID Eric Wong
  2020-09-19  9:37 ` [PATCH 7/7] gcf2: wire up read-only daemons and rm -gcf2 script Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta

Hopefully this allows others to more quickly figure out what's
going on.
---
 lib/PublicInbox/Gcf2.pm       | 5 +++--
 lib/PublicInbox/Gcf2Client.pm | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Gcf2.pm b/lib/PublicInbox/Gcf2.pm
index 6ac3aa18..fe76b1fd 100644
--- a/lib/PublicInbox/Gcf2.pm
+++ b/lib/PublicInbox/Gcf2.pm
@@ -1,7 +1,8 @@
 # Copyright (C) 2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-#
-# git-cat-file based on libgit2
+
+# backend for public-inbox-gcf2(1) (git-cat-file based on libgit2,
+# other libgit2 stuff may go here, too)
 package PublicInbox::Gcf2;
 use strict;
 use PublicInbox::Spawn qw(which popen_rd);
diff --git a/lib/PublicInbox/Gcf2Client.pm b/lib/PublicInbox/Gcf2Client.pm
index 5120698f..a048cd1a 100644
--- a/lib/PublicInbox/Gcf2Client.pm
+++ b/lib/PublicInbox/Gcf2Client.pm
@@ -1,5 +1,7 @@
 # Copyright (C) 2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# connects public-inbox processes to public-inbox-gcf2(1)
 package PublicInbox::Gcf2Client;
 use strict;
 use parent 'PublicInbox::Git';

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 6/7] gcf2: require git dir with OID
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
                   ` (4 preceding siblings ...)
  2020-09-19  9:37 ` [PATCH 5/7] gcf2*: more descriptive package descriptions Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  2020-09-19  9:37 ` [PATCH 7/7] gcf2: wire up read-only daemons and rm -gcf2 script Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta

This amortizes the cost of recreating PublicInbox::Gcf2 objects
when alternates change in v2 all.git.
---
 lib/PublicInbox/Gcf2Client.pm  | 13 -------------
 lib/PublicInbox/Git.pm         |  5 +++--
 lib/PublicInbox/GitAsyncCat.pm | 24 +++++++++++++++---------
 script/public-inbox-gcf2       | 14 +++++---------
 t/gcf2_client.t                | 11 ++++-------
 5 files changed, 27 insertions(+), 40 deletions(-)

diff --git a/lib/PublicInbox/Gcf2Client.pm b/lib/PublicInbox/Gcf2Client.pm
index a048cd1a..30f85c71 100644
--- a/lib/PublicInbox/Gcf2Client.pm
+++ b/lib/PublicInbox/Gcf2Client.pm
@@ -23,19 +23,6 @@ sub new {
 	$self;
 }
 
-sub add_git_dir {
-	my ($self, $git_dir) = @_;
-
-	# ensure buffers are drained, length($git_dir) may exceed
-	# PIPE_BUF on platforms where PIPE_BUF is only 512 bytes
-	my $inflight = $self->{inflight};
-	while (scalar(@$inflight)) {
-		$self->cat_async_step($inflight);
-	}
-	print { $self->{out} } $git_dir, "\n" or
-				$self->fail("write error: $!");
-}
-
 # always false, since -gcf2 retries internally
 sub alternates_changed {}
 
diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index b49b5bd3..6bb82b6b 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -190,7 +190,8 @@ sub cat_async_step ($$) {
 		$bref = my_read($self->{in}, $rbuf, $size + 1) or
 			fail($self, defined($bref) ? 'read EOF' : "read: $!");
 		chop($$bref) eq "\n" or fail($self, 'LF missing after blob');
-	} elsif ($head =~ / missing$/) {
+	} elsif ($head =~ s/ missing\n//s) {
+		$oid = $head;
 		# ref($req) indicates it's already been retried
 		# -gcf2 retries internally, so it never hits this path:
 		if (!ref($req) && !$in_cleanup && $self->alternates_changed) {
@@ -198,7 +199,7 @@ sub cat_async_step ($$) {
 						$req, $cb, $arg);
 		}
 		$type = 'missing';
-		$oid = ref($req) ? $$req : $req;
+		$oid = ref($req) ? $$req : $req if $oid eq '';
 	} else {
 		fail($self, "Unexpected result from async git cat-file: $head");
 	}
diff --git a/lib/PublicInbox/GitAsyncCat.pm b/lib/PublicInbox/GitAsyncCat.pm
index db1a7f94..8a54c608 100644
--- a/lib/PublicInbox/GitAsyncCat.pm
+++ b/lib/PublicInbox/GitAsyncCat.pm
@@ -16,21 +16,27 @@ our @EXPORT = qw(git_async_cat);
 
 sub event_step {
 	my ($self) = @_;
-	my $git = $self->{git};
-	return $self->close if ($git->{in} // 0) != ($self->{sock} // 1);
-	my $inflight = $git->{inflight};
+	my $gitish = $self->{gitish};
+	return $self->close if ($gitish->{in} // 0) != ($self->{sock} // 1);
+	my $inflight = $gitish->{inflight};
 	if ($inflight && @$inflight) {
-		$git->cat_async_step($inflight);
-		$self->requeue if @$inflight || exists $git->{cat_rbuf};
+		$gitish->cat_async_step($inflight);
+		$self->requeue if @$inflight || exists $gitish->{cat_rbuf};
 	}
 }
 
 sub git_async_cat ($$$$) {
 	my ($git, $oid, $cb, $arg) = @_;
-	$git->cat_async($oid, $cb, $arg);
-	$git->{async_cat} //= do {
-		my $self = bless { git => $git }, __PACKAGE__;
-		$self->SUPER::new($git->{in}, EPOLLIN|EPOLLET);
+	my $gitish = $git->{gcf2c}; # PublicInbox::Gcf2Client
+	if ($gitish) {
+		$oid .= " $git->{git_dir}";
+	} else {
+		$gitish = $git;
+	}
+	$gitish->cat_async($oid, $cb, $arg);
+	$gitish->{async_cat} //= do {
+		my $self = bless { gitish => $gitish }, __PACKAGE__;
+		$self->SUPER::new($gitish->{in}, EPOLLIN|EPOLLET);
 		\undef; # this is a true ref()
 	};
 }
diff --git a/script/public-inbox-gcf2 b/script/public-inbox-gcf2
index d2d2ac8b..4a44b654 100755
--- a/script/public-inbox-gcf2
+++ b/script/public-inbox-gcf2
@@ -3,7 +3,6 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 eval { require PublicInbox::Gcf2 };
 die "libgit2 development package or Inline::C missing for $0: $@\n" if $@;
-my @dirs; # may get big (30K-100K)
 my $gcf2 = PublicInbox::Gcf2::new();
 use IO::Handle; # autoflush
 STDERR->autoflush(1);
@@ -11,19 +10,16 @@ STDOUT->autoflush(1);
 
 while (<STDIN>) {
 	chomp;
-	if (m!\A/!) { # +/path/to/git-dir
-		push @dirs, $_;
-		$gcf2->add_alternate("$_/objects");
-	} elsif (!$gcf2->cat_oid(1, $_)) {
+	my ($oid, $git_dir) = split(/ /, $_, 2);
+	$gcf2->add_alternate("$git_dir/objects");
+	if (!$gcf2->cat_oid(1, $oid)) {
 		# retry once if missing.  We only get unabbreviated OIDs
 		# from SQLite or Xapian DBs, here, so malicious clients
 		# can't trigger excessive retries:
-		my $oid = $_;
-		warn "I: $$ $oid missing, retrying...\n";
+		warn "I: $$ $oid missing, retrying in $git_dir...\n";
 
-		# clients may need to wait a bit for this:
 		$gcf2 = PublicInbox::Gcf2::new();
-		$gcf2->add_alternate("$_/objects") for @dirs;
+		$gcf2->add_alternate("$git_dir/objects");
 
 		if ($gcf2->cat_oid(1, $oid)) {
 			warn "I: $$ $oid found after retry\n";
diff --git a/t/gcf2_client.t b/t/gcf2_client.t
index 0f7e7203..19462379 100644
--- a/t/gcf2_client.t
+++ b/t/gcf2_client.t
@@ -27,9 +27,7 @@ my $err_f = "$tmpdir/err";
 	local $ENV{PATH} = getcwd()."/blib/script:$ENV{PATH}";
 	open my $err, '>', $err_f or BAIL_OUT $!;
 	my $gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
-	$gcf2c->add_git_dir($git_a);
-
-	$gcf2c->cat_async($tree, sub {
+	$gcf2c->cat_async("$tree $git_a", sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
 		is($oid, $tree, 'got expected OID');
 		is($size, 30, 'got expected length');
@@ -45,7 +43,7 @@ my $err_f = "$tmpdir/err";
 	is($estr, '', 'nothing in stderr');
 
 	my $trunc = substr($tree, 0, 39);
-	$gcf2c->cat_async($trunc, sub {
+	$gcf2c->cat_async("$trunc $git_a", sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
 		is(undef, $bref, 'missing bref is undef');
 		is($oid, $trunc, 'truncated OID printed');
@@ -63,8 +61,7 @@ my $err_f = "$tmpdir/err";
 	# try failed alternates lookup
 	open $err, '>', $err_f or BAIL_OUT $!;
 	$gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
-	$gcf2c->add_git_dir($git_b);
-	$gcf2c->cat_async($tree, sub {
+	$gcf2c->cat_async("$tree $git_b", sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
 		is(undef, $bref, 'missing bref from alt is undef');
 		$called++;
@@ -79,7 +76,7 @@ my $err_f = "$tmpdir/err";
 	print $alt "$git_a/objects\n" or BAIL_OUT $!;
 	close $alt or BAIL_OUT;
 	my $expect = xqx(['git', "--git-dir=$git_a", qw(cat-file tree), $tree]);
-	$gcf2c->cat_async($tree, sub {
+	$gcf2c->cat_async("$tree $git_a", sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
 		is($oid, $tree, 'oid match on alternates retry');
 		is($$bref, $expect, 'tree content matched');

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 7/7] gcf2: wire up read-only daemons and rm -gcf2 script
  2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
                   ` (5 preceding siblings ...)
  2020-09-19  9:37 ` [PATCH 6/7] gcf2: require git dir with OID Eric Wong
@ 2020-09-19  9:37 ` Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-09-19  9:37 UTC (permalink / raw)
  To: meta

It seems easiest to have a singleton Gcf2Client client object
per daemon worker for all inboxes to use.  This reduces overall
FD usage from pipes.

The `public-inbox-gcf2' command + manpage are gone and a `$^X'
one-liner is used, instead.  This saves inodes for internal
commands and hopefully makes it easier to avoid mismatched
PERL5LIB include paths (as noticed during development :x).

We'll also make the existing cat-file process management
infrastructure more resilient to BOFHs on process killing
sprees (or in case our libgit2-based code fails on us).

(Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd
won't automatically benefit from this change, and extra
configuration will be required (to be documented later).
---
 Documentation/public-inbox-gcf2.pod | 63 -----------------------------
 MANIFEST                            |  2 -
 Makefile.PL                         |  5 ---
 lib/PublicInbox/Daemon.pm           | 11 +++++
 lib/PublicInbox/Gcf2.pm             | 36 ++++++++++++++++-
 lib/PublicInbox/Gcf2Client.pm       | 59 +++++++++++++++++++++------
 lib/PublicInbox/Git.pm              | 31 +++++++++-----
 lib/PublicInbox/GitAsyncCat.pm      | 55 +++++++++++++++++++++++--
 lib/PublicInbox/IMAP.pm             |  2 +-
 script/public-inbox-gcf2            | 31 --------------
 script/public-inbox-httpd           |  1 +
 t/gcf2_client.t                     | 14 ++++---
 12 files changed, 172 insertions(+), 138 deletions(-)
 delete mode 100644 Documentation/public-inbox-gcf2.pod
 delete mode 100755 script/public-inbox-gcf2

diff --git a/Documentation/public-inbox-gcf2.pod b/Documentation/public-inbox-gcf2.pod
deleted file mode 100644
index 813fbe7f..00000000
--- a/Documentation/public-inbox-gcf2.pod
+++ /dev/null
@@ -1,63 +0,0 @@
-=head1 NAME
-
-public-inbox-gcf2 - internal libgit2-based blob retriever
-
-=head1 SYNOPSIS
-
-	This is an internal command used by public-inbox.
-	It may change unrecognizably or cease to exist at some point
-
-=head1 DESCRIPTION
-
-public-inbox-gcf2 is an optional internal process used by
-public-inbox daemons for read-only access to underlying git
-repositories.
-
-Users are NOT expected to run public-inbox-gcf2 on their own.
-It replaces multiple C<git cat-file --batch> processes by treating
-any git repos it knows about as alternates.
-
-None of its behaviors are stable and it is ALL subject to change
-at any time.
-
-Any lines written to its standard input prefixed with a C</>
-are interpreted as a git directory.  That git directory
-will be suffixed with "/objects" and treated as an alternate.
-It writes nothing to stdout in this case.
-
-Otherwise it behaves like C<git cat-file --batch>, but only accepts
-unabbreviated hexadecimal object IDs in its standard input.
-Its output format is identical to C<git cat-file --batch>.  It
-only works for L<public-inbox-v2-format(5)> inboxes and v1
-inboxes indexed by L<public-inbox-index(1)>.
-
-=head1 OPTIONS
-
-=head1 ENVIRONMENT
-
-=over 8
-
-=item PERL_INLINE_DIRECTORY
-
-This must be set unless C<~/.cache/public-inbox/inline-c>
-exists.  C<public-inbox-gcf2> uses L<Inline::C> and libgit2
-and compiles a small shim on its first run.
-
-=back
-
-=head1 CONTACT
-
-Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
-
-The mail archives are hosted at L<https://public-inbox.org/meta/>
-and L<http://hjrcffqmbrq6wope.onion/meta/>
-
-=head1 COPYRIGHT
-
-Copyright 2020 all contributors L<mailto:meta@public-inbox.org>
-
-License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
-
-=head1 SEE ALSO
-
-L<git-cat-file(1)>
diff --git a/MANIFEST b/MANIFEST
index 91457dab..67615d4d 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -26,7 +26,6 @@ Documentation/public-inbox-config.pod
 Documentation/public-inbox-convert.pod
 Documentation/public-inbox-daemon.pod
 Documentation/public-inbox-edit.pod
-Documentation/public-inbox-gcf2.pod
 Documentation/public-inbox-httpd.pod
 Documentation/public-inbox-imapd.pod
 Documentation/public-inbox-index.pod
@@ -223,7 +222,6 @@ sa_config/user/.spamassassin/user_prefs
 script/public-inbox-compact
 script/public-inbox-convert
 script/public-inbox-edit
-script/public-inbox-gcf2
 script/public-inbox-httpd
 script/public-inbox-imapd
 script/public-inbox-index
diff --git a/Makefile.PL b/Makefile.PL
index 5a268362..3fe9acf8 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -71,11 +71,6 @@ $v->{gz_docs} = [ map { "$_.gz" } (@{$v->{docs}},@{$v->{docs_html}}) ];
 $v->{rsync_docs} = [ @{$v->{gz_docs}}, @{$v->{docs}},
 	@{$v->{docs_html}}, qw(NEWS.atom NEWS.atom.gz)];
 
-# filter out public-inbox-gcf2 from the website, it's an internal command
-for my $var (qw(gz_docs rsync_docs)) {
-	@{$v->{$var}} = grep(!/-gcf2/, @{$v->{$var}});
-}
-
 # external manpages which we host ourselves, since some packages
 # (currently just Xapian) doesn't host manpages themselves.
 my @xman = qw(copydatabase.1 xapian-compact.1);
diff --git a/lib/PublicInbox/Daemon.pm b/lib/PublicInbox/Daemon.pm
index b929ec2a..1520f8f2 100644
--- a/lib/PublicInbox/Daemon.pm
+++ b/lib/PublicInbox/Daemon.pm
@@ -19,6 +19,7 @@ use PublicInbox::Syscall qw($SFD_NONBLOCK);
 require PublicInbox::Listener;
 use PublicInbox::EOFpipe;
 use PublicInbox::Sigfd;
+use PublicInbox::GitAsyncCat;
 my @CMD;
 my ($set_user, $oldset);
 my (@cfg_listen, $stdout, $stderr, $group, $user, $pid_file, $daemonize);
@@ -652,6 +653,16 @@ sub run ($$$;$) {
 	daemon_prepare($default);
 	my $af_default = $default =~ /:8080\z/ ? 'httpready' : undef;
 	my $for_destroy = daemonize();
+
+	# this wastes a bit of memory for non-PublicInbox::WWW -httpd users
+	# oh well...
+	eval {
+		require PublicInbox::Gcf2;
+		require PublicInbox::Gcf2Client;
+	};
+	local $PublicInbox::GitAsyncCat::GCF2C =
+				PublicInbox::Gcf2Client::new() if !$@;
+
 	daemon_loop($refresh, $post_accept, $tlsd, $af_default);
 	PublicInbox::DS->Reset;
 	# ->DESTROY runs when $for_destroy goes out-of-scope
diff --git a/lib/PublicInbox/Gcf2.pm b/lib/PublicInbox/Gcf2.pm
index fe76b1fd..7983c841 100644
--- a/lib/PublicInbox/Gcf2.pm
+++ b/lib/PublicInbox/Gcf2.pm
@@ -1,12 +1,13 @@
 # Copyright (C) 2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
-# backend for public-inbox-gcf2(1) (git-cat-file based on libgit2,
-# other libgit2 stuff may go here, too)
+# backend for a git-cat-file-workalike based on libgit2,
+# other libgit2 stuff may go here, too.
 package PublicInbox::Gcf2;
 use strict;
 use PublicInbox::Spawn qw(which popen_rd);
 use Fcntl qw(LOCK_EX);
+use IO::Handle; # autoflush
 my (%CFG, $c_src, $lockfh);
 BEGIN {
 	# PublicInbox::Spawn will set PERL_INLINE_DIRECTORY
@@ -54,4 +55,35 @@ use Inline C => $c_src;
 undef $c_src;
 undef %CFG;
 undef $lockfh;
+
+# Usage: $^X -MPublicInbox::Gcf2 -e 'PublicInbox::Gcf2::loop()'
+# (see lib/PublicInbox/Gcf2Client.pm)
+sub loop {
+	my $gcf2 = new();
+	STDERR->autoflush(1);
+	STDOUT->autoflush(1);
+
+	while (<STDIN>) {
+		chomp;
+		my ($oid, $git_dir) = split(/ /, $_, 2);
+		$gcf2->add_alternate("$git_dir/objects");
+		if (!$gcf2->cat_oid(1, $oid)) {
+			# retry once if missing.  We only get unabbreviated OIDs
+			# from SQLite or Xapian DBs, here, so malicious clients
+			# can't trigger excessive retries:
+			warn "I: $$ $oid missing, retrying in $git_dir\n";
+
+			$gcf2 = new();
+			$gcf2->add_alternate("$git_dir/objects");
+
+			if ($gcf2->cat_oid(1, $oid)) {
+				warn "I: $$ $oid found after retry\n";
+			} else {
+				warn "W: $$ $oid missing after retry\n";
+				print "$oid missing\n"; # mimic git-cat-file
+			}
+		}
+	}
+}
+
 1;
diff --git a/lib/PublicInbox/Gcf2Client.pm b/lib/PublicInbox/Gcf2Client.pm
index 30f85c71..42ff1bf3 100644
--- a/lib/PublicInbox/Gcf2Client.pm
+++ b/lib/PublicInbox/Gcf2Client.pm
@@ -1,29 +1,62 @@
 # Copyright (C) 2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
-# connects public-inbox processes to public-inbox-gcf2(1)
+# connects public-inbox processes to PublicInbox::Gcf2::loop()
 package PublicInbox::Gcf2Client;
 use strict;
-use parent 'PublicInbox::Git';
+use parent qw(PublicInbox::DS);
+use PublicInbox::Git;
 use PublicInbox::Spawn qw(popen_rd);
 use IO::Handle ();
+use PublicInbox::Syscall qw(EPOLLONESHOT EPOLLOUT);
+# fields:
+#	async_cat => GitAsyncCat ref (read-only pipe)
+#	sock => writable pipe to Gcf2::loop
 
-sub new {
-	my ($rdr) = @_;
-	my $self = bless {}, __PACKAGE__;
+sub new { bless($_[0] // {}, __PACKAGE__) }
+
+sub gcf2c_begin ($) {
+	my ($self) = @_;
+	# ensure the child process has the same @INC we do:
+	my $env = { PERL5LIB => join(':', @INC) };
 	my ($out_r, $out_w);
-	pipe($out_r, $out_w) or $self->fail("pipe failed: $!");
-	$rdr //= {};
-	$rdr->{0} = $out_r;
-	@$self{qw(in pid)} = popen_rd(['public-inbox-gcf2'], undef, $rdr);
-	$self->{inflight} = [];
-	$self->{out} = $out_w;
+	pipe($out_r, $out_w) or die "pipe failed: $!";
+	my $rdr = { 0 => $out_r, 2 => $self->{2} };
+	my $cmd = [$^X, qw[-MPublicInbox::Gcf2 -e PublicInbox::Gcf2::loop()]];
+	@$self{qw(in pid)} = popen_rd($cmd, $env, $rdr);
 	fcntl($out_w, 1031, 4096) if $^O eq 'linux'; # 1031: F_SETPIPE_SZ
 	$out_w->autoflush(1);
-	$self;
+	$out_w->blocking(0);
+	$self->SUPER::new($out_w, 0); # EPOLL_CTL_ADD (a bit wasteful :x)
+	$self->{inflight} = [];
+}
+
+sub fail {
+	my $self = shift;
+	$self->close; # PublicInbox::DS::close
+	PublicInbox::Git::fail($self, @_);
+}
+
+sub cat_async ($$$;$) {
+	my ($self, $req, $cb, $arg) = @_;
+	my $inflight = $self->{inflight} // gcf2c_begin($self);
+
+	# rare, I hope:
+	cat_async_step($self, $inflight) if $self->{wbuf};
+
+	$self->write(\"$req\n") or $self->fail("gcf2c write: $!");
+	push @$inflight, $req, $cb, $arg;
 }
 
-# always false, since -gcf2 retries internally
+# ensure PublicInbox::Git::cat_async_step never calls cat_async_retry
 sub alternates_changed {}
 
+no warnings 'once';
+
+# this is the write-only end of a pipe, DS->EventLoop will call this
+*event_step = \&PublicInbox::DS::flush_write;
+
+# used by GitAsyncCat
+*cat_async_step = \&PublicInbox::Git::cat_async_step;
+
 1;
diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 6bb82b6b..2323cecc 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -185,11 +185,12 @@ sub cat_async_step ($$) {
 	my $rbuf = delete($self->{cat_rbuf}) // \(my $new = '');
 	my ($bref, $oid, $type, $size);
 	my $head = my_readline($self->{in}, $rbuf);
+	# ->fail may be called via Gcf2Client.pm
 	if ($head =~ /^([0-9a-f]{40,}) (\S+) ([0-9]+)$/) {
 		($oid, $type, $size) = ($1, $2, $3 + 0);
 		$bref = my_read($self->{in}, $rbuf, $size + 1) or
-			fail($self, defined($bref) ? 'read EOF' : "read: $!");
-		chop($$bref) eq "\n" or fail($self, 'LF missing after blob');
+			$self->fail(defined($bref) ? 'read EOF' : "read: $!");
+		chop($$bref) eq "\n" or $self->fail('LF missing after blob');
 	} elsif ($head =~ s/ missing\n//s) {
 		$oid = $head;
 		# ref($req) indicates it's already been retried
@@ -201,7 +202,7 @@ sub cat_async_step ($$) {
 		$type = 'missing';
 		$oid = ref($req) ? $$req : $req if $oid eq '';
 	} else {
-		fail($self, "Unexpected result from async git cat-file: $head");
+		$self->fail("Unexpected result from async git cat-file: $head");
 	}
 	eval { $cb->($bref, $oid, $type, $size, $arg) };
 	$self->{cat_rbuf} = $rbuf if $$rbuf ne '';
@@ -304,10 +305,12 @@ sub check {
 
 sub _destroy {
 	my ($self, $rbuf, $in, $out, $pid, $err) = @_;
-	my $p = delete $self->{$pid} or return;
 	delete @$self{($rbuf, $in, $out)};
 	delete $self->{$err} if $err; # `err_c'
 
+	# GitAsyncCat::event_step may delete {pid}
+	my $p = delete $self->{$pid} or return;
+
 	# PublicInbox::DS may not be loaded
 	eval { PublicInbox::DS::dwaitpid($p, undef, undef) };
 	waitpid($p, 0) if $@; # wait synchronously if not in event loop
@@ -315,14 +318,21 @@ sub _destroy {
 
 sub cat_async_abort ($) {
 	my ($self) = @_;
-	my $inflight = delete $self->{inflight} or die 'BUG: not in async';
+	if (my $inflight = delete $self->{inflight}) {
+		while (@$inflight) {
+			my ($req, $cb, $arg) = splice(@$inflight, 0, 3);
+			$req =~ s/ .*//; # drop git_dir for Gcf2Client
+			eval { $cb->(undef, $req, undef, undef, $arg) };
+			warn "E: $req: $@ (in abort)\n" if $@;
+		}
+	}
 	cleanup($self);
 }
 
 sub fail {
 	my ($self, $msg) = @_;
-	$self->{inflight} ? cat_async_abort($self) : cleanup($self);
-	croak("git $self->{git_dir}: $msg");
+	cat_async_abort($self);
+	croak(ref($self) . ' ' . ($self->{git_dir} // '') . ": $msg");
 }
 
 sub popen {
@@ -352,6 +362,7 @@ sub cleanup {
 	!!($self->{pid} || $self->{pid_c});
 }
 
+
 # assuming a well-maintained repo, this should be a somewhat
 # accurate estimation of its size
 # TODO: show this in the WWW UI as a hint to potential cloners
@@ -397,7 +408,7 @@ sub pub_urls {
 sub cat_async_begin {
 	my ($self) = @_;
 	cleanup($self) if $self->alternates_changed;
-	batch_prepare($self);
+	$self->batch_prepare;
 	die 'BUG: already in async' if $self->{inflight};
 	$self->{inflight} = [];
 }
@@ -413,11 +424,9 @@ sub cat_async ($$$;$) {
 	push(@$inflight, $oid, $cb, $arg);
 }
 
-# this is safe to call inside $cb, but not guaranteed to enqueue
-# returns true if successful, undef if not.
 sub async_prefetch {
 	my ($self, $oid, $cb, $arg) = @_;
-	if (defined($self->{async_cat}) && (my $inflight = $self->{inflight})) {
+	if (my $inflight = $self->{inflight}) {
 		# we could use MAX_INFLIGHT here w/o the halving,
 		# but lets not allow one client to monopolize a git process
 		if (scalar(@$inflight) < int(MAX_INFLIGHT/2)) {
diff --git a/lib/PublicInbox/GitAsyncCat.pm b/lib/PublicInbox/GitAsyncCat.pm
index 8a54c608..b9dbe0cc 100644
--- a/lib/PublicInbox/GitAsyncCat.pm
+++ b/lib/PublicInbox/GitAsyncCat.pm
@@ -11,23 +11,49 @@
 package PublicInbox::GitAsyncCat;
 use strict;
 use parent qw(PublicInbox::DS Exporter);
+use POSIX qw(WNOHANG);
 use PublicInbox::Syscall qw(EPOLLIN EPOLLET);
-our @EXPORT = qw(git_async_cat);
+our @EXPORT = qw(git_async_cat git_async_prefetch);
+use PublicInbox::Git ();
+
+our $GCF2C; # singleton PublicInbox::Gcf2Client
+
+sub close {
+	my ($self) = @_;
+
+	if (my $gitish = delete $self->{gitish}) {
+		PublicInbox::Git::cat_async_abort($gitish);
+	}
+	$self->SUPER::close; # PublicInbox::DS::close
+}
 
 sub event_step {
 	my ($self) = @_;
-	my $gitish = $self->{gitish};
+	my $gitish = $self->{gitish} or return;
 	return $self->close if ($gitish->{in} // 0) != ($self->{sock} // 1);
 	my $inflight = $gitish->{inflight};
 	if ($inflight && @$inflight) {
 		$gitish->cat_async_step($inflight);
-		$self->requeue if @$inflight || exists $gitish->{cat_rbuf};
+
+		# child death?
+		if (($gitish->{in} // 0) != ($self->{sock} // 1)) {
+			$self->close;
+		} elsif (@$inflight || exists $gitish->{cat_rbuf}) {
+			# ok, more to do, requeue for fairness
+			$self->requeue;
+		}
+	} elsif ((my $pid = waitpid($gitish->{pid}, WNOHANG)) > 0) {
+		# May happen if the child process is killed by a BOFH
+		# (or segfaults)
+		delete $gitish->{pid};
+		warn "E: gitish $pid exited with \$?=$?\n";
+		$self->close;
 	}
 }
 
 sub git_async_cat ($$$$) {
 	my ($git, $oid, $cb, $arg) = @_;
-	my $gitish = $git->{gcf2c}; # PublicInbox::Gcf2Client
+	my $gitish = $GCF2C;
 	if ($gitish) {
 		$oid .= " $git->{git_dir}";
 	} else {
@@ -41,4 +67,25 @@ sub git_async_cat ($$$$) {
 	};
 }
 
+# this is safe to call inside $cb, but not guaranteed to enqueue
+# returns true if successful, undef if not.
+sub git_async_prefetch {
+	my ($git, $oid, $cb, $arg) = @_;
+	if ($GCF2C) {
+		if ($GCF2C->{async_cat} && !$GCF2C->{wbuf}) {
+			$oid .= " $git->{git_dir}";
+			return $GCF2C->cat_async($oid, $cb, $arg);
+		}
+	} elsif ($git->{async_cat} && (my $inflight = $git->{inflight})) {
+		# we could use MAX_INFLIGHT here w/o the halving,
+		# but lets not allow one client to monopolize a git process
+		if (@$inflight < int(PublicInbox::Git::MAX_INFLIGHT/2)) {
+			print { $git->{out} } $oid, "\n" or
+						$git->fail("write error: $!");
+			return push(@$inflight, $oid, $cb, $arg);
+		}
+	}
+	undef;
+}
+
 1;
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 47c08aea..a861282f 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -626,7 +626,7 @@ sub fetch_blob_cb { # called by git->cat_async via git_async_cat
 	}
 	my $pre;
 	if (!$self->{wbuf} && (my $nxt = $msgs->[0])) {
-		$pre = $self->{ibx}->git->async_prefetch($nxt->{blob},
+		$pre = git_async_prefetch($self->{ibx}->git, $nxt->{blob},
 						\&fetch_blob_cb, $fetch_arg);
 	}
 	fetch_run_ops($self, $smsg, $bref, $ops, $partial);
diff --git a/script/public-inbox-gcf2 b/script/public-inbox-gcf2
deleted file mode 100755
index 4a44b654..00000000
--- a/script/public-inbox-gcf2
+++ /dev/null
@@ -1,31 +0,0 @@
-#!perl -w
-# Copyright (C) 2020 all contributors <meta@public-inbox.org>
-# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-eval { require PublicInbox::Gcf2 };
-die "libgit2 development package or Inline::C missing for $0: $@\n" if $@;
-my $gcf2 = PublicInbox::Gcf2::new();
-use IO::Handle; # autoflush
-STDERR->autoflush(1);
-STDOUT->autoflush(1);
-
-while (<STDIN>) {
-	chomp;
-	my ($oid, $git_dir) = split(/ /, $_, 2);
-	$gcf2->add_alternate("$git_dir/objects");
-	if (!$gcf2->cat_oid(1, $oid)) {
-		# retry once if missing.  We only get unabbreviated OIDs
-		# from SQLite or Xapian DBs, here, so malicious clients
-		# can't trigger excessive retries:
-		warn "I: $$ $oid missing, retrying in $git_dir...\n";
-
-		$gcf2 = PublicInbox::Gcf2::new();
-		$gcf2->add_alternate("$git_dir/objects");
-
-		if ($gcf2->cat_oid(1, $oid)) {
-			warn "I: $$ $oid found after retry\n";
-		} else {
-			warn "W: $$ $oid missing after retry\n";
-			print "$oid missing\n"; # mimic git-cat-file
-		}
-	}
-}
diff --git a/script/public-inbox-httpd b/script/public-inbox-httpd
index b8159f3a..3befdab8 100755
--- a/script/public-inbox-httpd
+++ b/script/public-inbox-httpd
@@ -13,6 +13,7 @@ BEGIN {
 	require PublicInbox::HTTP;
 	require PublicInbox::HTTPD;
 }
+
 my %httpds;
 my $app;
 my $refresh = sub {
diff --git a/t/gcf2_client.t b/t/gcf2_client.t
index 19462379..f1302a54 100644
--- a/t/gcf2_client.t
+++ b/t/gcf2_client.t
@@ -6,6 +6,7 @@ use PublicInbox::TestCommon;
 use Test::More;
 use Cwd qw(getcwd);
 use PublicInbox::Import;
+use PublicInbox::DS;
 
 require_mods('PublicInbox::Gcf2');
 use_ok 'PublicInbox::Gcf2Client';
@@ -24,8 +25,8 @@ my $tree = 'fdbc43725f21f485051c17463b50185f4c3cf88c';
 my $called = 0;
 my $err_f = "$tmpdir/err";
 {
-	local $ENV{PATH} = getcwd()."/blib/script:$ENV{PATH}";
-	open my $err, '>', $err_f or BAIL_OUT $!;
+	PublicInbox::DS->Reset;
+	open my $err, '>>', $err_f or BAIL_OUT $!;
 	my $gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
 	$gcf2c->cat_async("$tree $git_a", sub {
 		my ($bref, $oid, $type, $size, $arg) = @_;
@@ -36,7 +37,7 @@ my $err_f = "$tmpdir/err";
 		is($arg, 'hi', 'arg passed');
 		$called++;
 	}, 'hi');
-	$gcf2c->cat_async_wait;
+	$gcf2c->cat_async_step($gcf2c->{inflight});
 
 	open $err, '<', $err_f or BAIL_OUT $!;
 	my $estr = do { local $/; <$err> };
@@ -52,13 +53,14 @@ my $err_f = "$tmpdir/err";
 		is($arg, 'bye', 'arg passed when missing');
 		$called++;
 	}, 'bye');
-	$gcf2c->cat_async_wait;
+	$gcf2c->cat_async_step($gcf2c->{inflight});
 
 	open $err, '<', $err_f or BAIL_OUT $!;
 	$estr = do { local $/; <$err> };
 	like($estr, qr/retrying/, 'warned about retry');
 
 	# try failed alternates lookup
+	PublicInbox::DS->Reset;
 	open $err, '>', $err_f or BAIL_OUT $!;
 	$gcf2c = PublicInbox::Gcf2Client::new({ 2 => $err });
 	$gcf2c->cat_async("$tree $git_b", sub {
@@ -66,7 +68,7 @@ my $err_f = "$tmpdir/err";
 		is(undef, $bref, 'missing bref from alt is undef');
 		$called++;
 	});
-	$gcf2c->cat_async_wait;
+	$gcf2c->cat_async_step($gcf2c->{inflight});
 	open $err, '<', $err_f or BAIL_OUT $!;
 	$estr = do { local $/; <$err> };
 	like($estr, qr/retrying/, 'warned about retry before alt update');
@@ -82,7 +84,7 @@ my $err_f = "$tmpdir/err";
 		is($$bref, $expect, 'tree content matched');
 		$called++;
 	});
-	$gcf2c->cat_async_wait;
+	$gcf2c->cat_async_step($gcf2c->{inflight});
 }
 is($called, 4, 'cat_async callbacks hit');
 done_testing;

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-09-19  9:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-19  9:37 [PATCH 0/7] gcf2: libgit2-based cat-file alternative Eric Wong
2020-09-19  9:37 ` [PATCH 1/7] gcf2: libgit2-based git " Eric Wong
2020-09-19  9:37 ` [PATCH 2/7] t/gcf2: test changes to alternates Eric Wong
2020-09-19  9:37 ` [PATCH 3/7] add gcf2 client and executable script Eric Wong
2020-09-19  9:37 ` [PATCH 4/7] gcf2: transparently retry on missing OID Eric Wong
2020-09-19  9:37 ` [PATCH 5/7] gcf2*: more descriptive package descriptions Eric Wong
2020-09-19  9:37 ` [PATCH 6/7] gcf2: require git dir with OID Eric Wong
2020-09-19  9:37 ` [PATCH 7/7] gcf2: wire up read-only daemons and rm -gcf2 script Eric Wong

user/dev discussion of public-inbox itself

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://public-inbox.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git