* [PATCH 1/6] contentid: use map to generate %seen for Message-Ids
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
This use of map {} is a common idiom as we no longer consider
the Message-ID as part of the digest.
---
lib/PublicInbox/ContentId.pm | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index eb937a0e..0c4a8678 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -60,12 +60,9 @@ sub content_digest ($) {
# References: and In-Reply-To: get used interchangeably
# in some "duplicates" in LKML. We treat them the same
# in SearchIdx, so treat them the same for this:
- my %seen;
- foreach my $mid (@{mids($hdr)}) {
- # do NOT consider the Message-ID as part of the content_id
- # if we got here, we've already got Message-ID reuse
- $seen{$mid} = 1;
- }
+ # do NOT consider the Message-ID as part of the content_id
+ # if we got here, we've already got Message-ID reuse
+ my %seen = map { $_ => 1 } @{mids($hdr)};
foreach my $mid (@{references($hdr)}) {
next if $seen{$mid};
$dig->add("ref\0$mid\0");
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
We can cut down on the number of operations required
using "grep" instead of "foreach".
---
lib/PublicInbox/NNTP.pm | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 35729f00..12f74c3d 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -423,10 +423,7 @@ sub set_nntp_headers ($$$$$) {
$hdr->header_set('Message-ID', $mid0);
my @alt = $hdr->header('X-Alt-Message-ID');
my %seen = map { $_ => 1 } (@alt, $mid0);
- foreach my $m (@mids) {
- next if $seen{$m}++;
- push @alt, $m;
- }
+ push(@alt, grep { !$seen{$_}++ } @mids);
$hdr->header_set('X-Alt-Message-ID', @alt);
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
2020-01-23 23:05 ` [PATCH 1/6] contentid: use map to generate %seen for Message-Ids Eric Wong
2020-01-23 23:05 ` [PATCH 2/6] nntp: simplify setting X-Alt-Message-ID Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
And add a note to remind ourselves to use List::Util::uniq
when it becomes common.
---
lib/PublicInbox/Inbox.pm | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index e834d565..07e8b5b7 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -293,12 +293,11 @@ sub nntp_url {
# nntp://news.example.com/alt.example
push @m, $u;
}
- my %seen = map { $_ => 1 } @urls;
- foreach my $u (@m) {
- next if $seen{$u};
- $seen{$u} = 1;
- push @urls, $u;
- }
+
+ # List::Util::uniq requires Perl 5.26+, maybe we
+ # can use it by 2030 or so
+ my %seen;
+ @urls = grep { !$seen{$_}++ } (@urls, @m);
}
\@urls;
};
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/6] mid: shorten uniq_mids logic
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (2 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 3/6] inbox: simplify filtering for duplicate NNTP URLs Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
We won't be able to use List::Util::uniq here, but we can still
shorten our logic and make it more consistent with the rest of
our code which does similar things.
---
lib/PublicInbox/MID.pm | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/lib/PublicInbox/MID.pm b/lib/PublicInbox/MID.pm
index d7a42c38..33d5af74 100644
--- a/lib/PublicInbox/MID.pm
+++ b/lib/PublicInbox/MID.pm
@@ -120,9 +120,7 @@ sub uniq_mids ($;$) {
warn "Message-ID: <$mid> too long, truncating\n";
$mid = substr($mid, 0, MAX_MID_SIZE);
}
- next if $seen->{$mid};
- push @ret, $mid;
- $seen->{$mid} = 1;
+ push(@ret, $mid) unless $seen->{$mid}++;
}
\@ret;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 5/6] wwwstream: shorten cloneurl uniquification
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (3 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 4/6] mid: shorten uniq_mids logic Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
2020-01-23 23:05 ` [PATCH 6/6] contentid: ignore duplicate References: headers Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
Another place where List::Scalar::uniq doesn't make sense,
but there's a small op reduction to be had anyways.
---
lib/PublicInbox/WwwStream.pm | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 8f5a6526..a724d069 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -89,12 +89,12 @@ sub _html_end {
my $ibx = $ctx->{-inbox};
my $desc = ascii_html($ibx->description);
- my (%seen, @urls);
+ my @urls;
my $http = $self->{base_url};
my $max = $ibx->max_git_epoch;
my $dir = (split(m!/!, $http))[-1];
+ my %seen = ($http => 1);
if (defined($max)) { # v2
- $seen{$http} = 1;
for my $i (0..$max) {
# old parts my be deleted:
-d "$ibx->{inboxdir}/git/$i.git" or next;
@@ -103,15 +103,13 @@ sub _html_end {
push @urls, "$url $dir/git/$i.git";
}
} else { # v1
- $seen{$http} = 1;
push @urls, $http;
}
# FIXME: epoch splits can be different in other repositories,
# use the "cloneurl" file as-is for now:
foreach my $u (@{$ibx->cloneurl}) {
- next if $seen{$u};
- $seen{$u} = 1;
+ next if $seen{$u}++;
push @urls, $u =~ /\Ahttps?:/ ? qq(<a\nhref="$u">$u</a>) : $u;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 6/6] contentid: ignore duplicate References: headers
2020-01-23 23:05 [PATCH 0/6] shorten and simplify uniq logic Eric Wong
` (4 preceding siblings ...)
2020-01-23 23:05 ` [PATCH 5/6] wwwstream: shorten cloneurl uniquification Eric Wong
@ 2020-01-23 23:05 ` Eric Wong
5 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2020-01-23 23:05 UTC (permalink / raw)
To: meta
OverIdx::parse_references already skips duplicate
References (which we use in SearchThread for rendering).
So there's no reason for our content deduplication logic
to care if a Message-Id in the Reference header is mentioned
twice.
---
lib/PublicInbox/ContentId.pm | 3 +--
lib/PublicInbox/OverIdx.pm | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm
index 0c4a8678..65691593 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentId.pm
@@ -64,8 +64,7 @@ sub content_digest ($) {
# if we got here, we've already got Message-ID reuse
my %seen = map { $_ => 1 } @{mids($hdr)};
foreach my $mid (@{references($hdr)}) {
- next if $seen{$mid};
- $dig->add("ref\0$mid\0");
+ $dig->add("ref\0$mid\0") unless $seen{$mid}++;
}
# Only use Sender: if From is not present
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 189bd21d..5f1007aa 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -230,8 +230,7 @@ sub parse_references ($$$) {
warn "References: <$ref> too long, ignoring\n";
next;
}
- next if $seen{$ref}++;
- push @keep, $ref;
+ push(@keep, $ref) unless $seen{$ref}++;
}
$smsg->{references} = '<'.join('> <', @keep).'>' if @keep;
\@keep;
^ permalink raw reply related [flat|nested] 7+ messages in thread