user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 00/22] lei query overview views
@ 2021-01-10 12:14  7% Eric Wong
  2021-01-10 12:14  6% ` [PATCH 02/22] lei q: deduplicate smsg Eric Wong
  0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2021-01-10 12:14 UTC (permalink / raw)
  To: meta

Usage summary:

	lei add-external /path/to/v1-or-v2-inbox
	lei add-external /path/to/another-inbox-or-ext-index
			# URLs aren't supported, yet :<

	lei q SEARCH TERMS GO HERE... # pager should open with JSON output

For faster startup time than what Inline::C can give:

	apt-get install libsocket-msghdr-perl # Socket::Msghdr

Having neither Inline::C nor Socket::Msghdr means parallel
queries won't work.

I went back-and-forth on a bunch of things but ultimately gave
up trying to support IO::FDPass since it got too fragile and
difficult to test with the work-queue distribution.

The pager runs from the client process (if using Socket::MsgHdr
or Inline::C), now.  It took at fair amount of work from my slow
brain to get pager shutdown to be instantaneous, though queries
which haven't output anything aren't easily interruptible...

The wq_* IPC stuff will be reused in the normal read-only
WWW/IMAP search at some point, too.

Eric Wong (22):
  lei query + pagination sorta working
  lei q: deduplicate smsg
  ds: block signals when reaping
  ipc: add support for asynchronous callbacks
  cmd_ipc: send FDs with buffer payload
  ipc: avoid excessive evals
  ipc: work queue support via SOCK_SEQPACKET
  ipc: eliminate ipc_worker_stop method
  ipc: wq: support dynamic worker count change
  ipc: drop -ipc_parent_pid field
  ipc: DESTROY and wq_workers methods
  lei: rename $w to $wpager for warning message
  lei: fix oneshot TTY detection by passing STD*{GLOB}
  lei: query: ensure pager exit is instantaneous
  ipc: start supporting sending/receiving more than 3 FDs
  ipc: fix IO::FDPass use with a worker limit of 1
  ipc: drop unused fields, default sighandlers for wq
  lei: get rid of client {pid} field
  lei: fork + FD cleanup
  lei: run pager in client script
  lei_xsearch: transfer 4 FDs internally, drop IO::FDPass
  lei: query: restore JSON output overview

 MANIFEST                        |   4 +
 lib/PublicInbox/CmdIPC4.pm      |  36 ++++
 lib/PublicInbox/DS.pm           |  16 +-
 lib/PublicInbox/Daemon.pm       |  10 +-
 lib/PublicInbox/ExtSearchIdx.pm |   4 +-
 lib/PublicInbox/IPC.pm          | 280 ++++++++++++++++++++++++++++----
 lib/PublicInbox/LEI.pm          | 180 +++++++++++++-------
 lib/PublicInbox/LeiDedupe.pm    |  29 +++-
 lib/PublicInbox/LeiExternal.pm  |  33 ++--
 lib/PublicInbox/LeiOverview.pm  | 188 +++++++++++++++++++++
 lib/PublicInbox/LeiQuery.pm     |  92 +++++++++++
 lib/PublicInbox/LeiStore.pm     |   2 +-
 lib/PublicInbox/LeiToMail.pm    |   2 +
 lib/PublicInbox/LeiXSearch.pm   | 118 +++++++++++++-
 lib/PublicInbox/Search.pm       |  10 +-
 lib/PublicInbox/SearchView.pm   |  10 +-
 lib/PublicInbox/Sigfd.pm        |  12 +-
 lib/PublicInbox/Spawn.pm        |  85 ++++++----
 lib/PublicInbox/Watch.pm        |   8 +-
 script/lei                      |  76 +++++----
 script/public-inbox-watch       |   4 +-
 t/cmd_ipc.t                     |  82 ++++++++++
 t/ipc.t                         | 115 ++++++++++++-
 t/lei.t                         |  31 +++-
 t/lei_dedupe.t                  |  14 ++
 t/lei_xsearch.t                 |   5 +
 t/spawn.t                       |  33 +---
 27 files changed, 1233 insertions(+), 246 deletions(-)
 create mode 100644 lib/PublicInbox/CmdIPC4.pm
 create mode 100644 lib/PublicInbox/LeiOverview.pm
 create mode 100644 lib/PublicInbox/LeiQuery.pm
 create mode 100644 t/cmd_ipc.t

^ permalink raw reply	[relevance 7%]

* [PATCH 02/22] lei q: deduplicate smsg
  2021-01-10 12:14  7% [PATCH 00/22] lei query overview views Eric Wong
@ 2021-01-10 12:14  6% ` Eric Wong
  0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2021-01-10 12:14 UTC (permalink / raw)
  To: meta

We don't want duplicate messages in results overviews, either.
---
 lib/PublicInbox/LeiDedupe.pm | 29 ++++++++++++++++++++++++++++-
 lib/PublicInbox/LeiQuery.pm  |  5 +++++
 t/lei_dedupe.t               | 14 ++++++++++++++
 3 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/LeiDedupe.pm b/lib/PublicInbox/LeiDedupe.pm
index c4e5dffb..58eee533 100644
--- a/lib/PublicInbox/LeiDedupe.pm
+++ b/lib/PublicInbox/LeiDedupe.pm
@@ -33,12 +33,24 @@ sub _regen_oid ($) {
 
 sub _oidbin ($) { defined($_[0]) ? pack('H*', $_[0]) : undef }
 
+sub smsg_hash ($) {
+	my ($smsg) = @_;
+	my $dig = Digest::SHA->new(256);
+	my $x = join("\0", @$smsg{qw(from to cc ds subject references mid)});
+	utf8::encode($x);
+	$dig->add($x);
+	$dig->digest;
+}
+
 # the paranoid option
 sub dedupe_oid () {
 	my $skv = PublicInbox::SharedKV->new;
 	($skv, sub { # may be called in a child process
 		my ($eml, $oid) = @_;
 		$skv->set_maybe(_oidbin($oid) // _regen_oid($eml), '');
+	}, sub {
+		my ($smsg) = @_;
+		$skv->set_maybe(_oidbin($smsg->{blob}), '');
 	});
 }
 
@@ -51,6 +63,12 @@ sub dedupe_mid () {
 		my $mid = $eml->header_raw('Message-ID') // _oidbin($oid) //
 			content_hash($eml);
 		$skv->set_maybe($mid, '');
+	}, sub {
+		my ($smsg) = @_;
+		my $mid = $smsg->{mid};
+		$mid = undef if $mid eq '';
+		$mid //= smsg_hash($smsg) // _oidbin($smsg->{blob});
+		$skv->set_maybe($mid, '');
 	});
 }
 
@@ -60,11 +78,15 @@ sub dedupe_content () {
 	($skv, sub { # may be called in a child process
 		my ($eml) = @_; # oid = $_[1], ignored
 		$skv->set_maybe(content_hash($eml), '');
+	}, sub {
+		my ($smsg) = @_;
+		$skv->set_maybe(smsg_hash($smsg), '');
 	});
 }
 
 # no deduplication at all
-sub dedupe_none () { (undef, sub { 1 }) }
+sub true { 1 }
+sub dedupe_none () { (undef, \&true, \&true) }
 
 sub new {
 	my ($cls, $lei, $dst) = @_;
@@ -85,6 +107,11 @@ sub is_dup {
 	!$self->[1]->($eml, $oid);
 }
 
+sub is_smsg_dup {
+	my ($self, $smsg) = @_;
+	!$self->[2]->($smsg);
+}
+
 sub prepare_dedupe {
 	my ($self) = @_;
 	my $skv = $self->[0];
diff --git a/lib/PublicInbox/LeiQuery.pm b/lib/PublicInbox/LeiQuery.pm
index d14da1bc..f69dccad 100644
--- a/lib/PublicInbox/LeiQuery.pm
+++ b/lib/PublicInbox/LeiQuery.pm
@@ -69,6 +69,8 @@ sub lei_q {
 	} @argv);
 	$opt->{limit} //= 10000;
 	my $lxs;
+	require PublicInbox::LeiDedupe;
+	my $dd = PublicInbox::LeiDedupe->new($self);
 
 	# --local is enabled by default
 	my @src = $opt->{'local'} ? ($sto->search) : ();
@@ -135,6 +137,7 @@ sub lei_q {
 		delete @$smsg{qw(tid num)}; # only makes sense if single src
 		chomp($buf = $json->encode(_smsg_unbless($smsg)));
 	};
+	$dd->prepare_dedupe;
 	for my $src (@src) {
 		my $srch = $src->search;
 		my $over = $src->over;
@@ -145,6 +148,7 @@ sub lei_q {
 		if ($smsg_for) {
 			for my $it ($mset->items) {
 				my $smsg = $smsg_for->($srch, $it) or next;
+				next if $dd->is_smsg_dup($smsg);
 				$self->out($buf .= $ORS) if defined $buf;
 				$smsg->{relevance} = get_pct($it);
 				$emit_cb->($smsg);
@@ -160,6 +164,7 @@ sub lei_q {
 			while ($over && $over->expand_thread($ctx)) {
 				for my $n (@{$ctx->{xids}}) {
 					my $t = $over->get_art($n) or next;
+					next if $dd->is_smsg_dup($t);
 					if (my $p = delete $n2p{$t->{num}}) {
 						$t->{relevance} = $p;
 					}
diff --git a/t/lei_dedupe.t b/t/lei_dedupe.t
index b5e2b8f9..6e971b9b 100644
--- a/t/lei_dedupe.t
+++ b/t/lei_dedupe.t
@@ -6,12 +6,16 @@ use v5.10.1;
 use Test::More;
 use PublicInbox::TestCommon;
 use PublicInbox::Eml;
+use PublicInbox::Smsg;
 require_mods(qw(DBD::SQLite));
 use_ok 'PublicInbox::LeiDedupe';
 my $eml = eml_load('t/plack-qp.eml');
 my $mid = $eml->header_raw('Message-ID');
 my $different = eml_load('t/msg_iter-order.eml');
 $different->header_set('Message-ID', $mid);
+my $smsg = bless { ds => time }, 'PublicInbox::Smsg';
+$smsg->populate($eml);
+$smsg->{$_} //= '' for (qw(to cc references)) ;
 
 my $lei = { opt => { dedupe => 'none' } };
 my $dd = PublicInbox::LeiDedupe->new($lei);
@@ -19,6 +23,8 @@ $dd->prepare_dedupe;
 ok(!$dd->is_dup($eml), '1st is_dup w/o dedupe');
 ok(!$dd->is_dup($eml), '2nd is_dup w/o dedupe');
 ok(!$dd->is_dup($different), 'different is_dup w/o dedupe');
+ok(!$dd->is_smsg_dup($smsg), 'smsg dedupe none 1');
+ok(!$dd->is_smsg_dup($smsg), 'smsg dedupe none 2');
 
 for my $strat (undef, 'content') {
 	$lei->{opt}->{dedupe} = $strat;
@@ -28,6 +34,8 @@ for my $strat (undef, 'content') {
 	ok(!$dd->is_dup($eml), "1st is_dup with $desc dedupe");
 	ok($dd->is_dup($eml), "2nd seen with $desc dedupe");
 	ok(!$dd->is_dup($different), "different is_dup with $desc dedupe");
+	ok(!$dd->is_smsg_dup($smsg), "is_smsg_dup pass w/ $desc dedupe");
+	ok($dd->is_smsg_dup($smsg), "is_smsg_dup reject w/ $desc dedupe");
 }
 $lei->{opt}->{dedupe} = 'bogus';
 eval { PublicInbox::LeiDedupe->new($lei) };
@@ -39,6 +47,8 @@ $dd->prepare_dedupe;
 ok(!$dd->is_dup($eml), '1st is_dup with mid dedupe');
 ok($dd->is_dup($eml), '2nd seen with mid dedupe');
 ok($dd->is_dup($different), 'different seen with mid dedupe');
+ok(!$dd->is_smsg_dup($smsg), 'smsg mid dedupe pass');
+ok($dd->is_smsg_dup($smsg), 'smsg mid dedupe reject');
 
 $lei->{opt}->{dedupe} = 'oid';
 $dd = PublicInbox::LeiDedupe->new($lei);
@@ -56,4 +66,8 @@ ok($dd->is_dup($different, '01d'), 'different content ignored if oid matches');
 ok($dd->is_dup($eml, '01D'), 'case insensitive oid comparison :P');
 ok(!$dd->is_dup($eml, '01dbad'), 'case insensitive oid comparison :P');
 
+$smsg->{blob} = 'dead';
+ok(!$dd->is_smsg_dup($smsg), 'smsg dedupe pass');
+ok($dd->is_smsg_dup($smsg), 'smsg dedupe reject');
+
 done_testing;

^ permalink raw reply related	[relevance 6%]

Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2021-01-10 12:14  7% [PATCH 00/22] lei query overview views Eric Wong
2021-01-10 12:14  6% ` [PATCH 02/22] lei q: deduplicate smsg Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).