user/dev discussion of public-inbox itself
 help / color / mirror / Atom feed
* [PATCH 0/6] shorten and simplify uniq logic
@ 2020-01-23 23:05 Eric Wong
  2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

I noticed List::Util 1.45+ includes a new "uniq()" sub, but
that's only distributed with Perl as of 5.26+.

Since we care about supporting older versions of Perl, I still
took the opportunity to simplify some of our own similar logic
for making things unique.  It turns out only Inbox->nntp_url
really benefits from List::Util::uniq at the moment, but there's
some small simplifications to be had along the way.

Eric Wong (6):
  contentid: use map to generate %seen for Message-Ids
  nntp: simplify setting X-Alt-Message-ID
  inbox: simplify filtering for duplicate NNTP URLs
  mid: shorten uniq_mids logic
  wwwstream: shorten cloneurl uniquification
  contentid: ignore duplicate References: headers

 lib/PublicInbox/ContentId.pm | 12 ++++--------
 lib/PublicInbox/Inbox.pm     | 11 +++++------
 lib/PublicInbox/MID.pm       |  4 +---
 lib/PublicInbox/NNTP.pm      |  5 +----
 lib/PublicInbox/OverIdx.pm   |  3 +--
 lib/PublicInbox/WwwStream.pm |  8 +++-----
 6 files changed, 15 insertions(+), 28 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/6] contentid: use map to generate %seen for Message-Ids
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

This use of map {} is a common idiom as we no longer consider
the Message-ID as part of the digest.
---
 lib/PublicInbox/ContentId.pm | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index eb937a0e..0c4a8678 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -60,12 +60,9 @@ sub content_digest ($) {
 	# References: and In-Reply-To: get used interchangeably
 	# in some "duplicates" in LKML.  We treat them the same
 	# in SearchIdx, so treat them the same for this:
-	my %seen;
-	foreach my $mid (@{mids($hdr)}) {
-		# do NOT consider the Message-ID as part of the content_id
-		# if we got here, we've already got Message-ID reuse
-		$seen{$mid} = 1;
-	}
+	# do NOT consider the Message-ID as part of the content_id
+	# if we got here, we've already got Message-ID reuse
+	my %seen = map { $_ => 1 } @{mids($hdr)};
 	foreach my $mid (@{references($hdr)}) {
 		next if $seen{$mid};
 		$dig->add("ref\0$mid\0");

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
  2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

We can cut down on the number of operations required
using "grep" instead of "foreach".
---
 lib/PublicInbox/NNTP.pm | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 35729f00..12f74c3d 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -423,10 +423,7 @@ sub set_nntp_headers ($$$$$) {
 		$hdr->header_set('Message-ID', $mid0);
 		my @alt = $hdr->header('X-Alt-Message-ID');
 		my %seen = map { $_ => 1 } (@alt, $mid0);
-		foreach my $m (@mids) {
-			next if $seen{$m}++;
-			push @alt, $m;
-		}
+		push(@alt, grep { !$seen{$_}++ } @mids);
 		$hdr->header_set('X-Alt-Message-ID', @alt);
 	}
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
  2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
  2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

And add a note to remind ourselves to use List::Util::uniq
when it becomes common.
---
 lib/PublicInbox/Inbox.pm | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index e834d565..07e8b5b7 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -293,12 +293,11 @@ sub nntp_url {
 				# nntp://news.example.com/alt.example
 				push @m, $u;
 			}
-			my %seen = map { $_ => 1 } @urls;
-			foreach my $u (@m) {
-				next if $seen{$u};
-				$seen{$u} = 1;
-				push @urls, $u;
-			}
+
+			# List::Util::uniq requires Perl 5.26+, maybe we
+			# can use it by 2030 or so
+			my %seen;
+			@urls = grep { !$seen{$_}++ } (@urls, @m);
 		}
 		\@urls;
 	};

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 4/6] mid: shorten uniq_mids logic
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
                   ` (2 preceding siblings ...)
  2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
  2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

We won't be able to use List::Util::uniq here, but we can still
shorten our logic and make it more consistent with the rest of
our code which does similar things.
---
 lib/PublicInbox/MID.pm | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index d7a42c38..33d5af74 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -120,9 +120,7 @@ sub uniq_mids ($;$) {
 			warn "Message-ID: <$mid> too long, truncating\n";
 			$mid = substr($mid, 0, MAX_MID_SIZE);
 		}
-		next if $seen->{$mid};
-		push @ret, $mid;
-		$seen->{$mid} = 1;
+		push(@ret, $mid) unless $seen->{$mid}++;
 	}
 	\@ret;
 }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 5/6] wwwstream: shorten cloneurl uniquification
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
                   ` (3 preceding siblings ...)
  2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

Another place where List::Scalar::uniq doesn't make sense,
but there's a small op reduction to be had anyways.
---
 lib/PublicInbox/WwwStream.pm | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 8f5a6526..a724d069 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -89,12 +89,12 @@ sub _html_end {
 	my $ibx = $ctx->{-inbox};
 	my $desc = ascii_html($ibx->description);
 
-	my (%seen, @urls);
+	my @urls;
 	my $http = $self->{base_url};
 	my $max = $ibx->max_git_epoch;
 	my $dir = (split(m!/!, $http))[-1];
+	my %seen = ($http => 1);
 	if (defined($max)) { # v2
-		$seen{$http} = 1;
 		for my $i (0..$max) {
 			# old parts my be deleted:
 			-d "$ibx->{inboxdir}/git/$i.git" or next;
@@ -103,15 +103,13 @@ sub _html_end {
 			push @urls, "$url $dir/git/$i.git";
 		}
 	} else { # v1
-		$seen{$http} = 1;
 		push @urls, $http;
 	}
 
 	# FIXME: epoch splits can be different in other repositories,
 	# use the "cloneurl" file as-is for now:
 	foreach my $u (@{$ibx->cloneurl}) {
-		next if $seen{$u};
-		$seen{$u} = 1;
+		next if $seen{$u}++;
 		push @urls, $u =~ /\Ahttps?:/ ? qq(<a\nhref="$u">$u</a>) : $u;
 	}
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 6/6] contentid: ignore duplicate References: headers
  2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
                   ` (4 preceding siblings ...)
  2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
  5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
  To: meta

OverIdx::parse_references already skips duplicate
References (which we use in SearchThread for rendering).
So there's no reason for our content deduplication logic
to care if a Message-Id in the Reference header is mentioned
twice.
---
 lib/PublicInbox/ContentId.pm | 3 +--
 lib/PublicInbox/OverIdx.pm   | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 0c4a8678..65691593 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -64,8 +64,7 @@ sub content_digest ($) {
 	# if we got here, we've already got Message-ID reuse
 	my %seen = map { $_ => 1 } @{mids($hdr)};
 	foreach my $mid (@{references($hdr)}) {
-		next if $seen{$mid};
-		$dig->add("ref\0$mid\0");
+		$dig->add("ref\0$mid\0") unless $seen{$mid}++;
 	}
 
 	# Only use Sender: if From is not present
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 189bd21d..5f1007aa 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -230,8 +230,7 @@ sub parse_references ($$$) {
 			warn "References: <$ref> too long, ignoring\n";
 			next;
 		}
-		next if $seen{$ref}++;
-		push @keep, $ref;
+		push(@keep, $ref) unless $seen{$ref}++;
 	}
 	$smsg->{references} = '<'.join('> <', @keep).'>' if @keep;
 	\@keep;

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-23 23:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong

user/dev discussion of public-inbox itself

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://public-inbox.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git