* [PATCH 0/6] shorten and simplify uniq logic
@ 2020-01-23 23:05 Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
I noticed List::Util 1.45+ includes a new "uniq()" sub, but
that's only distributed with Perl as of 5.26+.
Since we care about supporting older versions of Perl, I still
took the opportunity to simplify some of our own similar logic
for making things unique. It turns out only Inbox->nntp_url
really benefits from List::Util::uniq at the moment, but there's
some small simplifications to be had along the way.
Eric Wong (6):
contentid: use map to generate %seen for Message-Ids
nntp: simplify setting X-Alt-Message-ID
inbox: simplify filtering for duplicate NNTP URLs
mid: shorten uniq_mids logic
wwwstream: shorten cloneurl uniquification
contentid: ignore duplicate References: headers
lib/PublicInbox/ContentId.pm | 12 ++++--------
lib/PublicInbox/Inbox.pm | 11 +++++------
lib/PublicInbox/MID.pm | 4 +---
lib/PublicInbox/NNTP.pm | 5 +----
lib/PublicInbox/OverIdx.pm | 3 +--
lib/PublicInbox/WwwStream.pm | 8 +++-----
6 files changed, 15 insertions(+), 28 deletions(-)
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/6] contentid: use map to generate %seen for Message-Ids
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
This use of map {} is a common idiom as we no longer consider
the Message-ID as part of the digest.
---
lib/PublicInbox/ContentId.pm | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index eb937a0e..0c4a8678 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -60,12 +60,9 @@ sub content_digest ($) {
# References: and In-Reply-To: get used interchangeably
# in some "duplicates" in LKML. We treat them the same
# in SearchIdx, so treat them the same for this:
- my %seen;
- foreach my $mid (@{mids($hdr)}) {
- # do NOT consider the Message-ID as part of the content_id
- # if we got here, we've already got Message-ID reuse
- $seen{$mid} = 1;
- }
+ # do NOT consider the Message-ID as part of the content_id
+ # if we got here, we've already got Message-ID reuse
+ my %seen = map { $_ => 1 } @{mids($hdr)};
foreach my $mid (@{references($hdr)}) {
next if $seen{$mid};
$dig->add("ref\0$mid\0");
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
We can cut down on the number of operations required
using "grep" instead of "foreach".
---
lib/PublicInbox/NNTP.pm | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 35729f00..12f74c3d 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -423,10 +423,7 @@ sub set_nntp_headers ($$$$$) {
$hdr->header_set('Message-ID', $mid0);
my @alt = $hdr->header('X-Alt-Message-ID');
my %seen = map { $_ => 1 } (@alt, $mid0);
- foreach my $m (@mids) {
- next if $seen{$m}++;
- push @alt, $m;
- }
+ push(@alt, grep { !$seen{$_}++ } @mids);
$hdr->header_set('X-Alt-Message-ID', @alt);
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
And add a note to remind ourselves to use List::Util::uniq
when it becomes common.
---
lib/PublicInbox/Inbox.pm | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index e834d565..07e8b5b7 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -293,12 +293,11 @@ sub nntp_url {
# nntp://news.example.com/alt.example
push @m, $u;
}
- my %seen = map { $_ => 1 } @urls;
- foreach my $u (@m) {
- next if $seen{$u};
- $seen{$u} = 1;
- push @urls, $u;
- }
+
+ # List::Util::uniq requires Perl 5.26+, maybe we
+ # can use it by 2030 or so
+ my %seen;
+ @urls = grep { !$seen{$_}++ } (@urls, @m);
}
\@urls;
};
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/6] mid: shorten uniq_mids logic
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (2 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
We won't be able to use List::Util::uniq here, but we can still
shorten our logic and make it more consistent with the rest of
our code which does similar things.
---
lib/PublicInbox/MID.pm | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index d7a42c38..33d5af74 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -120,9 +120,7 @@ sub uniq_mids ($;$) {
warn "Message-ID: <$mid> too long, truncating\n";
$mid = substr($mid, 0, MAX_MID_SIZE);
}
- next if $seen->{$mid};
- push @ret, $mid;
- $seen->{$mid} = 1;
+ push(@ret, $mid) unless $seen->{$mid}++;
}
\@ret;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 5/6] wwwstream: shorten cloneurl uniquification
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (3 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
Another place where List::Scalar::uniq doesn't make sense,
but there's a small op reduction to be had anyways.
---
lib/PublicInbox/WwwStream.pm | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 8f5a6526..a724d069 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -89,12 +89,12 @@ sub _html_end {
my $ibx = $ctx->{-inbox};
my $desc = ascii_html($ibx->description);
- my (%seen, @urls);
+ my @urls;
my $http = $self->{base_url};
my $max = $ibx->max_git_epoch;
my $dir = (split(m!/!, $http))[-1];
+ my %seen = ($http => 1);
if (defined($max)) { # v2
- $seen{$http} = 1;
for my $i (0..$max) {
# old parts my be deleted:
-d "$ibx->{inboxdir}/git/$i.git" or next;
@@ -103,15 +103,13 @@ sub _html_end {
push @urls, "$url $dir/git/$i.git";
}
} else { # v1
- $seen{$http} = 1;
push @urls, $http;
}
# FIXME: epoch splits can be different in other repositories,
# use the "cloneurl" file as-is for now:
foreach my $u (@{$ibx->cloneurl}) {
- next if $seen{$u};
- $seen{$u} = 1;
+ next if $seen{$u}++;
push @urls, $u =~ /\Ahttps?:/ ? qq(<a\nhref="$u">$u</a>) : $u;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 6/6] contentid: ignore duplicate References: headers
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (4 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
OverIdx::parse_references already skips duplicate
References (which we use in SearchThread for rendering).
So there's no reason for our content deduplication logic
to care if a Message-Id in the Reference header is mentioned
twice.
---
lib/PublicInbox/ContentId.pm | 3 +--
lib/PublicInbox/OverIdx.pm | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 0c4a8678..65691593 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -64,8 +64,7 @@ sub content_digest ($) {
# if we got here, we've already got Message-ID reuse
my %seen = map { $_ => 1 } @{mids($hdr)};
foreach my $mid (@{references($hdr)}) {
- next if $seen{$mid};
- $dig->add("ref\0$mid\0");
+ $dig->add("ref\0$mid\0") unless $seen{$mid}++;
}
# Only use Sender: if From is not present
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 189bd21d..5f1007aa 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -230,8 +230,7 @@ sub parse_references ($$$) {
warn "References: <$ref> too long, ignoring\n";
next;
}
- next if $seen{$ref}++;
- push @keep, $ref;
+ push(@keep, $ref) unless $seen{$ref}++;
}
$smsg->{references} = '<'.join('> <', @keep).'>' if @keep;
\@keep;
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-01-23 23:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).