From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.2 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.6 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id AE8601F406; Mon, 20 Nov 2023 03:21:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=80x24.org; s=selector1; t=1700450492; bh=yZKeEAZs6ABMAFUBkXRrCS8klG/j7ez3qlGsf608vEA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MebRsDVM8hbvYfJh0HZTG+txlHiHwa/UlRAjcrlI/IPE52fcHKiTJlp44wSlwiGG/ lvoX6bQLd9Gx/I+BJ4qppgmlRhZDUXPVY8xh7dHCcT48RADLte8afwGkH5rcx6ga+O W6IpgArHZDwakHEoDNF1LUlT0JErvtFb9nlsPWUU= Date: Mon, 20 Nov 2023 03:21:32 +0000 From: Eric Wong To: "Robin H. Johnson" Cc: meta@public-inbox.org Subject: [RFC] altid: start supporting indexfilter type (was: Alternate permalink URLs) Message-ID: <20231120032132.M610564@dcvr> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: "Robin H. Johnson" wrote: > Hi, > > This is more of a feature request / request for pointers on how to tweak > the design to support something, and it might be suited to maintaining > as a local patch. Since the indexing internals are somewhat in flux and tied to Xapian and Perl, I'm happy to carry it to ensure it stays working (similar to the "altid" and existing Filter stuff). > The permalinks offered by public-inbox are great, but at Gentoo Linux, > we'd like to ALSO continue to offer our historical permalinks. Would you want the historical permalinks displayed on in the PublicInbox::WWW HTML UI? That is already the slowest and most expensive part of public-inbox, so I'm hesitant to support more options which slow it down (though I'm halfway considering introducing more C to speed that part up...) With the patch below, you should be able to use: https://public-inbox.gentoo.org/gentoo-dev/?q=xarchiveshash:499b958da430b925dbd2f2b58e0f507e The same way works. (Maybe that would be better with an "I'm Feeling Lucky" search...) > For those, the permalink slug portion was built when the mail arrived > into the archives ingest pipeline. > > Example legacy link: > https://archives.gentoo.org/gentoo-dev/message/499b958da430b925dbd2f2b58e0f507e > > We'd need to tweak the index somehow to expose it. > > That same mail as visible in our public-inbox test site: > https://public-inbox.gentoo.org/gentoo-dev/538ce05eef3f4df3468cbc7f7abfa90eb2ea7d51.camel@gentoo.org/raw > > The permalink slug is in the header: > X-Archives-Hash: 499b958da430b925dbd2f2b58e0f507e > > This needs to end up in the Xapian index (which doesn't seem to index > headers right now), and then get wired up as a route: > On access, redirect to public-inbox permalink. > > Pointers on where in the codebase to wire up the Xapian side greatly > appreciated, since it doesn't seem to be indexing arbitrary headers > right now. The indexing+search part is something that's been requested by others, too. With the below patch, setting: altid = indexfilter:xarchiveshash:package=XArchivesHash for a given inbox, you should be able to search on "xarchiveshash:$hash" the same way the "gmane:$INTEGER" altid search works for public-inbox.org/git/ Sidenote: Unfortunately, altid needs to be configured per-inbox, but I suppose indexfilter (unlike serial) makes sense to support globally in the future... You can also replace "xarchiveshash" with any unused all-lowercase prefix (my brain kept leaving out the "s" while writing tests and I was puzzled why it didn't work at first :x). If you want to carry a private plugin to search on "foo:" using MyPackage::Foo, you should be able to add this to the publicinbox.$NAME section: altid = indexfilter:foo:package=MyPackage::Foo But I'm a bit hesitant to declare the indexing internals a stable API to support into eternity. So I'd rather take a patch to handle stuff in the PublicInbox::IndexFilter::* namespace. ------8<------- Subject: [RFC] altid: start supporting indexfilter type In addition to the traditional AltId serial numbers from external sources (e.g. gmane), we can support Xapian-only indexing filters using Perl packages in the PublicInbox::IndexFilter::* namespace. Unlike the old `serial' type, this requires no separate SQLite DB since it's data is expected to be contained within the raw message. `indexfilter' only affects Xapian indexing, and isn't subject to the stricter `serial' type which enforced a 1:1 Message-ID <=> integer relationship used for NNTP. Unlike the existing PublicInbox::Filter::* namespace, this doesn't affect message delivery paths (-watch/-mda) at all and can be used from (clone|fetch)-synchronized mirrors. The new PublicInbox::IndexFilter::XArchivesHash may be a starting point for Gentoo archives, but other packages can be added for other hosts. This depends on Perl modules being implemented for each case; but I figure using Perl directly is preferable to having some new syntax that gets translated (likely poorly!) to actual Perl. In other words, we're trying not to reinvent or reimplement procmail, sieve, or any other mail processing language. Link: https://public-inbox.org/meta/robbat2-20231119T232932-954868624Z@orbis-terrarum.net/ --- MANIFEST | 2 + lib/PublicInbox/AltId.pm | 32 ++++--- lib/PublicInbox/IndexFilter/XArchivesHash.pm | 30 +++++++ lib/PublicInbox/Search.pm | 17 +++- lib/PublicInbox/SearchIdx.pm | 11 ++- t/watch_indexfilter_xarchiveshash.t | 90 ++++++++++++++++++++ 6 files changed, 164 insertions(+), 18 deletions(-) create mode 100644 lib/PublicInbox/IndexFilter/XArchivesHash.pm create mode 100644 t/watch_indexfilter_xarchiveshash.t diff --git a/MANIFEST b/MANIFEST index e1c3dc97..d4173f20 100644 --- a/MANIFEST +++ b/MANIFEST @@ -226,6 +226,7 @@ lib/PublicInbox/In2Tie.pm lib/PublicInbox/Inbox.pm lib/PublicInbox/InboxIdle.pm lib/PublicInbox/InboxWritable.pm +lib/PublicInbox/IndexFilter/XArchivesHash.pm lib/PublicInbox/Inotify.pm lib/PublicInbox/InputPipe.pm lib/PublicInbox/Isearch.pm @@ -614,6 +615,7 @@ t/v2writable.t t/view.t t/watch_filter_rubylang.t t/watch_imap.t +t/watch_indexfilter_xarchiveshash.t t/watch_maildir.t t/watch_maildir_v2.t t/watch_multiple_headers.t diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm index 80757ceb..5b917edb 100644 --- a/lib/PublicInbox/AltId.pm +++ b/lib/PublicInbox/AltId.pm @@ -21,27 +21,37 @@ use PublicInbox::Msgmap; sub new { my ($class, $ibx, $spec, $writable) = @_; my ($type, $prefix, $query) = split(/:/, $spec, 3); - $type eq 'serial' or die "non-serial not supported, yet\n"; $prefix =~ /\A\w+\z/ or warn "non-word prefix not searchable\n"; my %params = map { my ($k, $v) = split(/=/, uri_unescape($_), 2); $v = '' unless defined $v; ($k, $v); } split(/[&;]/, $query); - my $f = $params{file} or die "file: required for $type spec $spec\n"; - unless (index($f, '/') == 0) { - if ($ibx->version == 1) { - $f = "$ibx->{inboxdir}/public-inbox/$f"; - } else { - $f = "$ibx->{inboxdir}/$f"; - } - } - bless { - filename => $f, + my $self = bless { writable => $writable, prefix => $prefix, xprefix => 'X'.uc($prefix), }, $class; + if ($type eq 'serial') { # traditional message-ID <=> integer mapping + my $f = $params{file} or die + "E: file required for $type altid=$spec\n"; + unless (index($f, '/') == 0) { + $f = $ibx->version == 1 ? + "$ibx->{inboxdir}/public-inbox/$f" : + "$ibx->{inboxdir}/$f"; + } + $self->{filename} = $f; + } elsif ($type eq 'indexfilter') { + my $pkg = $params{package} // + die "E: package= unset for altid=$spec\n"; + $pkg =~ m!::! or $pkg = "PublicInbox::IndexFilter::$pkg"; + eval "require $pkg"; + die "E: could not load $pkg for altid=$spec: $@" if $@; + $self->{indexfilter} = $pkg->new; + } else { + die "non-serial/non-indexfilter not supported, yet ($type)\n" + } + $self; } sub mm_alt { diff --git a/lib/PublicInbox/IndexFilter/XArchivesHash.pm b/lib/PublicInbox/IndexFilter/XArchivesHash.pm new file mode 100644 index 00000000..238a5925 --- /dev/null +++ b/lib/PublicInbox/IndexFilter/XArchivesHash.pm @@ -0,0 +1,30 @@ +# Copyright (C) all contributors +# License: AGPL-3.0+ + +# map allow searching on X-Archives-Hash +package PublicInbox::IndexFilter::XArchivesHash; +use v5.12; +use Carp qw(carp); + +# attach sidx (SearchIdx) object to $self? +sub new { bless {}, __PACKAGE__ } + +# called by SearchIdx (internal APIs are unstable) +sub index_filter { + my ($self, $sidx, $doc, $eml, $pfx) = @_; + # $sidx may be used for index_phrase in packages + my @h = grep /\A(?:[a-f0-9]{32})\z/, # strict RE + $eml->header_raw('X-Archives-Hash'); + if (scalar(@h) == 0) { + carp 'E: no hash in X-Archives-Hash <', + $eml->header_raw('Message-ID'), '>'; + } elsif (scalar(@h) != 1) { + carp "W: multiple hashes in X-Archives-Hash: @h"; + # fall-through to index all of them: + } + $doc->add_boolean_term($pfx.$_) for @h; +} + +# TODO: unindex_filter? maybe unneeded since entire Xapian doc is deleted + +1; diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 477f77dc..bee86a6d 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -507,15 +507,26 @@ sub qparse_new { # just parse the spec to avoid the extra DB handles for now. if (my $altid = $self->{altid}) { my $user_pfx = $self->{-user_pfx} = []; + # FIXME: consider moving some of this logic to AltId.pm for (@$altid) { # $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3' # note: Xapian supports multibyte UTF-8, /^[0-9]+$/, # and '_' with prefixes matching \w+ - /\Aserial:(\w+):/ or next; - my $pfx = $1; - push @$user_pfx, "$pfx:", < XGMANE $qp->add_boolean_prefix($pfx, 'X'.uc($pfx)); } diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 9566b14d..c5ddba45 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -474,10 +474,13 @@ sub eml2doc ($$$;$) { if (my $altid = $self->{-altid}) { foreach my $alt (@$altid) { my $pfx = $alt->{xprefix}; - foreach my $mid (@$mids) { - my $id = $alt->mid2alt($mid); - next unless defined $id; - $doc->add_boolean_term($pfx . $id); + if (my $idxf = $alt->{indexfilter}) { + $idxf->index_filter($self, $doc, $eml, $pfx); + } else { # traditional Message-ID <=> NNTP number map + for my $mid (@$mids) { + my $id = $alt->mid2alt($mid) // next; + $doc->add_boolean_term($pfx . $id); + } } } } diff --git a/t/watch_indexfilter_xarchiveshash.t b/t/watch_indexfilter_xarchiveshash.t new file mode 100644 index 00000000..c0af8fcc --- /dev/null +++ b/t/watch_indexfilter_xarchiveshash.t @@ -0,0 +1,90 @@ +# Copyright (C) all contributors +# License: AGPL-3.0+ +use v5.12; +use autodie; +use PublicInbox::TestCommon; +use PublicInbox::Eml; +use PublicInbox::Emergency; +use PublicInbox::IO qw(write_file); +use PublicInbox::InboxIdle; +use PublicInbox::Inbox; +use PublicInbox::DS; +use PublicInbox::Config; +require_mods(qw(DBD::SQLite Xapian)); +my $tmpdir = tmpdir; +my $config = "$tmpdir/pi_config"; +local $ENV{PI_CONFIG} = $config; +delete local $ENV{PI_DIR}; +my @V = (1); +my @creat_opt = (indexlevel => 'medium', sub {}); +my $v1 = create_inbox 'v1', tmpdir => "$tmpdir/v1", @creat_opt; +my $fh = write_file '>', $config, <{inboxdir} + address = v1\@example.com + watch = maildir:$tmpdir/v1-md + altid = indexfilter:xarchiveshash:package=XArchivesHash +EOM + +SKIP: { + require_git(v2.6, 1); + push @V, 2; + my $v2 = create_inbox 'v2', tmpdir => "$tmpdir/v2", @creat_opt; + my $pkg = 'PublicInbox::IndexFilter::XArchivesHash'; + print $fh <new; +for my $v (@V) { for ('', qw(cur new tmp)) { mkdir "$tmpdir/v$v-md/$_" } } +my $wm = start_script([qw(-watch)]); +my $h1 = 'deadbeef' x 4; +my @em = map { + my $v = $_; + my $em = PublicInbox::Emergency->new("$tmpdir/v$v-md"); + $em->prepare(\(PublicInbox::Eml->new(<as_string)); +From: x\@example.com +Message-ID: +To: +Date: Sat, 02 Oct 2010 00:00:00 +0000 +X-Archives-Hash: $h1 + +EOM + $em; +} @V; + +my $delivered = 0; +my $cb = sub { + diag "message delivered to `$_[0]->{name}'"; + ++$delivered; +}; +PublicInbox::DS->Reset; +my $ii = PublicInbox::InboxIdle->new($cfg); +my $obj = bless \$cb, 'PublicInbox::TestCommon::InboxWakeup'; +$cfg->each_inbox(sub { $_[0]->subscribe_unlock('ident', $obj) }); +local @PublicInbox::DS::post_loop_do = (sub { $delivered != @V }); +$_->commit for @em; +diag 'waiting for -watch to import new message(s)'; +PublicInbox::DS::event_loop(); +$wm->join('TERM'); +$ii->close; + +$cfg->each_inbox(sub { + my ($ibx) = @_; + my $srch = $ibx->search; + my $mset = $srch->mset('xarchiveshash:miss'); + is($mset->size, 0, 'got xarchiveshash:miss non-result'); + $mset = $srch->mset("xarchiveshash:$h1"); + is($mset->size, 1, 'got xarchiveshash: hit result') or return; + my $num = $srch->mset_to_artnums($mset); + my $eml = $ibx->smsg_eml($ibx->over->get_art($num->[0])); + is($eml->header_raw('X-Archives-Hash'), $h1, + 'stored message with X-Archives-Hash'); +}); + +done_testing;