user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Alternate permalink URLs - for migration from other/custom archive solutions
@ 2023-11-19 23:47 Robin H. Johnson
  2023-11-20  3:21 ` [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs) Eric Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Robin H. Johnson @ 2023-11-19 23:47 UTC (permalink / raw)
  To: meta

[-- Attachment #1: Type: text/plain, Size: 1254 bytes --]

Hi,

This is more of a feature request / request for pointers on how to tweak
the design to support something, and it might be suited to maintaining
as a local patch.

The permalinks offered by public-inbox are great, but at Gentoo Linux,
we'd like to ALSO continue to offer our historical permalinks.

For those, the permalink slug portion was built when the mail arrived
into the archives ingest pipeline.

Example legacy link:
https://archives.gentoo.org/gentoo-dev/message/499b958da430b925dbd2f2b58e0f507e

We'd need to tweak the index somehow to expose it.

That same mail as visible in our public-inbox test site:
https://public-inbox.gentoo.org/gentoo-dev/538ce05eef3f4df3468cbc7f7abfa90eb2ea7d51.camel@gentoo.org/raw

The permalink slug is in the header:
X-Archives-Hash: 499b958da430b925dbd2f2b58e0f507e

This needs to end up in the Xapian index (which doesn't seem to index
headers right now), and then get wired up as a route:
On access, redirect to public-inbox permalink.

Pointers on where in the codebase to wire up the Xapian side greatly
appreciated, since it doesn't seem to be indexing arbitrary headers
right now.

-- 
Robin Hugh Johnson
Pronouns   : They/he
E-Mail     : robbat2@orbis-terrarum.net

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs)
  2023-11-19 23:47 Alternate permalink URLs - for migration from other/custom archive solutions Robin H. Johnson
@ 2023-11-20  3:21 ` Eric Wong
  2023-12-08 21:23   ` Eric Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Wong @ 2023-11-20  3:21 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: meta

"Robin H. Johnson" <robbat2@orbis-terrarum.net> wrote:
> Hi,
> 
> This is more of a feature request / request for pointers on how to tweak
> the design to support something, and it might be suited to maintaining
> as a local patch.

Since the indexing internals are somewhat in flux and tied to
Xapian and Perl, I'm happy to carry it to ensure it stays
working (similar to the "altid" and existing Filter stuff).

> The permalinks offered by public-inbox are great, but at Gentoo Linux,
> we'd like to ALSO continue to offer our historical permalinks.

Would you want the historical permalinks displayed on in the
PublicInbox::WWW HTML UI?  That is already the slowest and most
expensive part of public-inbox, so I'm hesitant to support more
options which slow it down (though I'm halfway considering
introducing more C to speed that part up...)

With the patch below, you should be able to use:

https://public-inbox.gentoo.org/gentoo-dev/?q=xarchiveshash:499b958da430b925dbd2f2b58e0f507e

The same way <https://public-inbox.org/git/?q=gmane:123> works.

(Maybe that would be better with an "I'm Feeling Lucky" search...)

> For those, the permalink slug portion was built when the mail arrived
> into the archives ingest pipeline.
> 
> Example legacy link:
> https://archives.gentoo.org/gentoo-dev/message/499b958da430b925dbd2f2b58e0f507e
> 
> We'd need to tweak the index somehow to expose it.
> 
> That same mail as visible in our public-inbox test site:
> https://public-inbox.gentoo.org/gentoo-dev/538ce05eef3f4df3468cbc7f7abfa90eb2ea7d51.camel@gentoo.org/raw
> 
> The permalink slug is in the header:
> X-Archives-Hash: 499b958da430b925dbd2f2b58e0f507e
> 
> This needs to end up in the Xapian index (which doesn't seem to index
> headers right now), and then get wired up as a route:
> On access, redirect to public-inbox permalink.
> 
> Pointers on where in the codebase to wire up the Xapian side greatly
> appreciated, since it doesn't seem to be indexing arbitrary headers
> right now.

The indexing+search part is something that's been requested by
others, too.  With the below patch, setting:

	altid = indexfilter:xarchiveshash:package=XArchivesHash

for a given inbox, you should be able to search on
"xarchiveshash:$hash" the same way the "gmane:$INTEGER" altid
search works for public-inbox.org/git/

Sidenote: Unfortunately, altid needs to be configured per-inbox, but
	I suppose indexfilter (unlike serial) makes sense to
	support globally in the future...

You can also replace "xarchiveshash" with any unused
all-lowercase prefix (my brain kept leaving out the "s" while
writing tests and I was puzzled why it didn't work at first :x).

If you want to carry a private plugin to search on "foo:" using
MyPackage::Foo, you should be able to add this to the
publicinbox.$NAME section:

	altid = indexfilter:foo:package=MyPackage::Foo

But I'm a bit hesitant to declare the indexing internals a
stable API to support into eternity.  So I'd rather take a patch
to handle stuff in the PublicInbox::IndexFilter::* namespace.

------8<-------
Subject: [RFC] altid: start supporting indexfilter type

In addition to the traditional AltId serial numbers from
external sources (e.g. gmane), we can support Xapian-only
indexing filters using Perl packages in the
PublicInbox::IndexFilter::* namespace.

Unlike the old `serial' type, this requires no separate SQLite
DB since it's data is expected to be contained within the raw
message.  `indexfilter' only affects Xapian indexing, and isn't
subject to the stricter `serial' type which enforced a 1:1
Message-ID <=> integer relationship used for NNTP.

Unlike the existing PublicInbox::Filter::* namespace, this
doesn't affect message delivery paths (-watch/-mda) at all
and can be used from (clone|fetch)-synchronized mirrors.

The new PublicInbox::IndexFilter::XArchivesHash may be a
starting point for Gentoo archives, but other packages can
be added for other hosts.

This depends on Perl modules being implemented for each case;
but I figure using Perl directly is preferable to having some
new syntax that gets translated (likely poorly!) to actual Perl.
In other words, we're trying not to reinvent or reimplement
procmail, sieve, or any other mail processing language.

Link: https://public-inbox.org/meta/robbat2-20231119T232932-954868624Z@orbis-terrarum.net/
---
 MANIFEST                                     |  2 +
 lib/PublicInbox/AltId.pm                     | 32 ++++---
 lib/PublicInbox/IndexFilter/XArchivesHash.pm | 30 +++++++
 lib/PublicInbox/Search.pm                    | 17 +++-
 lib/PublicInbox/SearchIdx.pm                 | 11 ++-
 t/watch_indexfilter_xarchiveshash.t          | 90 ++++++++++++++++++++
 6 files changed, 164 insertions(+), 18 deletions(-)
 create mode 100644 lib/PublicInbox/IndexFilter/XArchivesHash.pm
 create mode 100644 t/watch_indexfilter_xarchiveshash.t

diff --git a/MANIFEST b/MANIFEST
index e1c3dc97..d4173f20 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -226,6 +226,7 @@ lib/PublicInbox/In2Tie.pm
 lib/PublicInbox/Inbox.pm
 lib/PublicInbox/InboxIdle.pm
 lib/PublicInbox/InboxWritable.pm
+lib/PublicInbox/IndexFilter/XArchivesHash.pm
 lib/PublicInbox/Inotify.pm
 lib/PublicInbox/InputPipe.pm
 lib/PublicInbox/Isearch.pm
@@ -614,6 +615,7 @@ t/v2writable.t
 t/view.t
 t/watch_filter_rubylang.t
 t/watch_imap.t
+t/watch_indexfilter_xarchiveshash.t
 t/watch_maildir.t
 t/watch_maildir_v2.t
 t/watch_multiple_headers.t
diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index 80757ceb..5b917edb 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -21,27 +21,37 @@ use PublicInbox::Msgmap;
 sub new {
 	my ($class, $ibx, $spec, $writable) = @_;
 	my ($type, $prefix, $query) = split(/:/, $spec, 3);
-	$type eq 'serial' or die "non-serial not supported, yet\n";
 	$prefix =~ /\A\w+\z/ or warn "non-word prefix not searchable\n";
 	my %params = map {
 		my ($k, $v) = split(/=/, uri_unescape($_), 2);
 		$v = '' unless defined $v;
 		($k, $v);
 	} split(/[&;]/, $query);
-	my $f = $params{file} or die "file: required for $type spec $spec\n";
-	unless (index($f, '/') == 0) {
-		if ($ibx->version == 1) {
-			$f = "$ibx->{inboxdir}/public-inbox/$f";
-		} else {
-			$f = "$ibx->{inboxdir}/$f";
-		}
-	}
-	bless {
-		filename => $f,
+	my $self = bless {
 		writable => $writable,
 		prefix => $prefix,
 		xprefix => 'X'.uc($prefix),
 	}, $class;
+	if ($type eq 'serial') { # traditional message-ID <=> integer mapping
+		my $f = $params{file} or die
+			"E: file required for $type altid=$spec\n";
+		unless (index($f, '/') == 0) {
+			$f = $ibx->version == 1 ?
+				"$ibx->{inboxdir}/public-inbox/$f" :
+				"$ibx->{inboxdir}/$f";
+		}
+		$self->{filename} = $f;
+	} elsif ($type eq 'indexfilter') {
+		my $pkg = $params{package} //
+			die "E: package= unset for altid=$spec\n";
+		$pkg =~ m!::! or $pkg = "PublicInbox::IndexFilter::$pkg";
+		eval "require $pkg";
+		die "E: could not load $pkg for altid=$spec: $@" if $@;
+		$self->{indexfilter} = $pkg->new;
+	} else {
+		die "non-serial/non-indexfilter not supported, yet ($type)\n"
+	}
+	$self;
 }
 
 sub mm_alt {
diff --git a/lib/PublicInbox/IndexFilter/XArchivesHash.pm b/lib/PublicInbox/IndexFilter/XArchivesHash.pm
new file mode 100644
index 00000000..238a5925
--- /dev/null
+++ b/lib/PublicInbox/IndexFilter/XArchivesHash.pm
@@ -0,0 +1,30 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# map allow searching on X-Archives-Hash
+package PublicInbox::IndexFilter::XArchivesHash;
+use v5.12;
+use Carp qw(carp);
+
+# attach sidx (SearchIdx) object to $self?
+sub new { bless {}, __PACKAGE__ }
+
+# called by SearchIdx (internal APIs are unstable)
+sub index_filter {
+	my ($self, $sidx, $doc, $eml, $pfx) = @_;
+	# $sidx may be used for index_phrase in packages
+	my @h = grep /\A(?:[a-f0-9]{32})\z/, # strict RE
+		$eml->header_raw('X-Archives-Hash');
+	if (scalar(@h) == 0) {
+		carp 'E: no hash in X-Archives-Hash <',
+			$eml->header_raw('Message-ID'), '>';
+	} elsif (scalar(@h) != 1) {
+		carp "W: multiple hashes in X-Archives-Hash: @h";
+		# fall-through to index all of them:
+	}
+	$doc->add_boolean_term($pfx.$_) for @h;
+}
+
+# TODO: unindex_filter? maybe unneeded since entire Xapian doc is deleted
+
+1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 477f77dc..bee86a6d 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -507,15 +507,26 @@ sub qparse_new {
 	# just parse the spec to avoid the extra DB handles for now.
 	if (my $altid = $self->{altid}) {
 		my $user_pfx = $self->{-user_pfx} = [];
+		# FIXME: consider moving some of this logic to AltId.pm
 		for (@$altid) {
 			# $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3'
 			# note: Xapian supports multibyte UTF-8, /^[0-9]+$/,
 			# and '_' with prefixes matching \w+
-			/\Aserial:(\w+):/ or next;
-			my $pfx = $1;
-			push @$user_pfx, "$pfx:", <<EOF;
+			/\A(serial|indexfilter):(\w+):/ or do {
+				warn "W: unsupported altid=$_\n";
+				next;
+			};
+			my ($type, $pfx) = ($1, $2);
+			if ($type eq 'serial') {
+				push @$user_pfx, "$pfx:", <<EOF;
 alternate serial number  e.g. $pfx:12345 (boolean)
 EOF
+			} elsif ($type eq 'indexfilter') {
+				# TODO: support help in IndexFilter classes?
+				push @$user_pfx, "$pfx:", <<EOF;
+alternate prefix e.g. $pfx:xyz
+EOF
+			}
 			# gmane => XGMANE
 			$qp->add_boolean_prefix($pfx, 'X'.uc($pfx));
 		}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 9566b14d..c5ddba45 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -474,10 +474,13 @@ sub eml2doc ($$$;$) {
 	if (my $altid = $self->{-altid}) {
 		foreach my $alt (@$altid) {
 			my $pfx = $alt->{xprefix};
-			foreach my $mid (@$mids) {
-				my $id = $alt->mid2alt($mid);
-				next unless defined $id;
-				$doc->add_boolean_term($pfx . $id);
+			if (my $idxf = $alt->{indexfilter}) {
+				$idxf->index_filter($self, $doc, $eml, $pfx);
+			} else { # traditional Message-ID <=> NNTP number map
+				for my $mid (@$mids) {
+					my $id = $alt->mid2alt($mid) // next;
+					$doc->add_boolean_term($pfx . $id);
+				}
 			}
 		}
 	}
diff --git a/t/watch_indexfilter_xarchiveshash.t b/t/watch_indexfilter_xarchiveshash.t
new file mode 100644
index 00000000..c0af8fcc
--- /dev/null
+++ b/t/watch_indexfilter_xarchiveshash.t
@@ -0,0 +1,90 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use v5.12;
+use autodie;
+use PublicInbox::TestCommon;
+use PublicInbox::Eml;
+use PublicInbox::Emergency;
+use PublicInbox::IO qw(write_file);
+use PublicInbox::InboxIdle;
+use PublicInbox::Inbox;
+use PublicInbox::DS;
+use PublicInbox::Config;
+require_mods(qw(DBD::SQLite Xapian));
+my $tmpdir = tmpdir;
+my $config = "$tmpdir/pi_config";
+local $ENV{PI_CONFIG} = $config;
+delete local $ENV{PI_DIR};
+my @V = (1);
+my @creat_opt = (indexlevel => 'medium', sub {});
+my $v1 = create_inbox 'v1', tmpdir => "$tmpdir/v1", @creat_opt;
+my $fh = write_file '>', $config, <<EOM;
+[publicinbox "v1"]
+	inboxdir = $v1->{inboxdir}
+	address = v1\@example.com
+	watch = maildir:$tmpdir/v1-md
+	altid = indexfilter:xarchiveshash:package=XArchivesHash
+EOM
+
+SKIP: {
+	require_git(v2.6, 1);
+	push @V, 2;
+	my $v2 = create_inbox 'v2', tmpdir => "$tmpdir/v2", @creat_opt;
+	my $pkg = 'PublicInbox::IndexFilter::XArchivesHash';
+	print $fh <<EOM;
+[publicinbox "v2"]
+	inboxdir = $tmpdir/v2
+	address = v2\@example.com
+	watch = maildir:$tmpdir/v2-md
+	altid = indexfilter:xarchiveshash:package=$pkg
+EOM
+}
+close $fh;
+my $cfg = PublicInbox::Config->new;
+for my $v (@V) { for ('', qw(cur new tmp)) { mkdir "$tmpdir/v$v-md/$_" } }
+my $wm = start_script([qw(-watch)]);
+my $h1 = 'deadbeef' x 4;
+my @em = map {
+	my $v = $_;
+	my $em = PublicInbox::Emergency->new("$tmpdir/v$v-md");
+	$em->prepare(\(PublicInbox::Eml->new(<<EOM)->as_string));
+From: x\@example.com
+Message-ID: <i-1$v\@example.com>
+To: <v$v\@example.com>
+Date: Sat, 02 Oct 2010 00:00:00 +0000
+X-Archives-Hash: $h1
+
+EOM
+	$em;
+} @V;
+
+my $delivered = 0;
+my $cb = sub {
+	diag "message delivered to `$_[0]->{name}'";
+	++$delivered;
+};
+PublicInbox::DS->Reset;
+my $ii = PublicInbox::InboxIdle->new($cfg);
+my $obj = bless \$cb, 'PublicInbox::TestCommon::InboxWakeup';
+$cfg->each_inbox(sub { $_[0]->subscribe_unlock('ident', $obj) });
+local @PublicInbox::DS::post_loop_do = (sub { $delivered != @V });
+$_->commit for @em;
+diag 'waiting for -watch to import new message(s)';
+PublicInbox::DS::event_loop();
+$wm->join('TERM');
+$ii->close;
+
+$cfg->each_inbox(sub {
+	my ($ibx) = @_;
+	my $srch = $ibx->search;
+	my $mset = $srch->mset('xarchiveshash:miss');
+	is($mset->size, 0, 'got xarchiveshash:miss non-result');
+	$mset = $srch->mset("xarchiveshash:$h1");
+	is($mset->size, 1, 'got xarchiveshash: hit result') or return;
+	my $num = $srch->mset_to_artnums($mset);
+	my $eml = $ibx->smsg_eml($ibx->over->get_art($num->[0]));
+	is($eml->header_raw('X-Archives-Hash'), $h1,
+		'stored message with X-Archives-Hash');
+});
+
+done_testing;

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs)
  2023-11-20  3:21 ` [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs) Eric Wong
@ 2023-12-08 21:23   ` Eric Wong
  2024-04-27  7:00     ` Eric Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Wong @ 2023-12-08 21:23 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: meta

Re: https://public-inbox.org/meta/20231120032132.M610564@dcvr/
Ping?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs)
  2023-12-08 21:23   ` Eric Wong
@ 2024-04-27  7:00     ` Eric Wong
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Wong @ 2024-04-27  7:00 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: meta, Robin H. Johnson

Ping.

Trying your gentoo address, since I'm wondering if your lack of
response was due to an address I hadn't seen before.

Any thoughts on this RFC for configuring additional header
indices?  I was just reminded of this by someone else...

https://public-inbox.org/meta/20231120032132.M610564@dcvr/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-27  7:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-19 23:47 Alternate permalink URLs - for migration from other/custom archive solutions Robin H. Johnson
2023-11-20  3:21 ` [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs) Eric Wong
2023-12-08 21:23   ` Eric Wong
2024-04-27  7:00     ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).