user/dev discussion of public-inbox itself
 help / color / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 01/24] linkify: support Internationalized Domain Names in URLs
Date: Tue,  4 Jun 2019 11:27:25 +0000
Message-ID: <20190604112748.23598-2-e@80x24.org> (raw)
In-Reply-To: <20190604112748.23598-1-e@80x24.org>

The "\w" character class in Perl matches any word characters
in the Unicode database, not just ASCII characters.  So we
must be prepared for that and generate links to IDNs.
---
 lib/PublicInbox/Linkify.pm |  5 +++--
 t/linkify.t                | 12 ++++++++++++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Linkify.pm b/lib/PublicInbox/Linkify.pm
index d4778e7..84960a9 100644
--- a/lib/PublicInbox/Linkify.pm
+++ b/lib/PublicInbox/Linkify.pm
@@ -13,6 +13,7 @@ package PublicInbox::Linkify;
 use strict;
 use warnings;
 use Digest::SHA qw/sha1_hex/;
+use PublicInbox::Hval qw(ascii_html);
 
 my $SALT = rand;
 my $LINK_RE = qr{([\('!])?\b((?:ftps?|https?|nntps?|gopher)://
@@ -61,12 +62,12 @@ sub linkify_1 {
 			$end = ')';
 		}
 
+		$url = ascii_html($url); # for IDN
+
 		# salt this, as this could be exploited to show
 		# links in the HTML which don't show up in the raw mail.
 		my $key = sha1_hex($url . $SALT);
 
-		# only escape ampersands, others do not match LINK_RE
-		$url =~ s/&/&#38;/g;
 		$_[0]->{$key} = $url;
 		$beg . 'PI-LINK-'. $key . $end;
 	^ge;
diff --git a/t/linkify.t b/t/linkify.t
index fe218b9..c492358 100644
--- a/t/linkify.t
+++ b/t/linkify.t
@@ -132,4 +132,16 @@ use PublicInbox::Linkify;
 		'punctuation with unpaired ) OK')
 }
 
+if ('IDN example: <ACDB98F4-178C-43C3-99C4-A1D03DD6A8F5@sb.org>') {
+	my $hc = '&#26376;';
+	my $u = "http://www.\x{6708}.example.com/";
+	my $s = $u;
+	my $l = PublicInbox::Linkify->new;
+	$s = $l->linkify_1($s);
+	$s = $l->linkify_2($s);
+	my $expect = qq{<a
+href="http://www.$hc.example.com/">http://www.$hc.example.com/</a>};
+	is($s, $expect, 'IDN message escaped properly');
+}
+
 done_testing();
-- 
EW


  reply index

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04 11:27 [PATCH 00/24] fix IDN linkification, add paranoia Eric Wong
2019-06-04 11:27 ` Eric Wong [this message]
2019-06-04 11:27 ` [PATCH 02/24] nntp: be explicit about ASCII digit matches Eric Wong
2019-06-04 11:27 ` [PATCH 03/24] nntp: ensure we only handle ASCII whitespace Eric Wong
2019-06-04 11:27 ` [PATCH 04/24] mid: id_compress requires ASCII-clean words Eric Wong
2019-06-04 11:27 ` [PATCH 05/24] feed: only accept ASCII digits for ref~$N Eric Wong
2019-06-04 11:27 ` [PATCH 06/24] http: require SERVER_PORT to be ASCII digit Eric Wong
2019-06-04 11:27 ` [PATCH 07/24] wwwlisting: require ASCII digit for port number Eric Wong
2019-06-04 11:27 ` [PATCH 08/24] wwwattach: only pass the charset through if ASCII Eric Wong
2019-06-04 11:27 ` [PATCH 09/24] www: only emit ASCII chars in attachment filenames Eric Wong
2019-06-04 11:27 ` [PATCH 10/24] www: require ASCII filenames in git blob downloads Eric Wong
2019-06-04 11:27 ` [PATCH 11/24] config: do not accept non-ASCII digits in cgitrc params Eric Wong
2019-06-04 11:27 ` [PATCH 12/24] newswww: only accept ASCII digits as article numbers Eric Wong
2019-06-04 11:27 ` [PATCH 13/24] view: require YYYYmmDD(HHMMSS) timestamps to be ASCII Eric Wong
2019-06-04 11:27 ` [PATCH 14/24] githttpbackend: require Range:, Status: to be ASCII digits Eric Wong
2019-06-04 11:27 ` [PATCH 15/24] searchview: do not allow non-ASCII offsets and limits Eric Wong
2019-06-04 11:27 ` [PATCH 16/24] msgtime: require ASCII digits for parsing dates Eric Wong
2019-06-04 11:27 ` [PATCH 17/24] filter/rubylang: require ASCII digit for mailcount Eric Wong
2019-06-04 11:27 ` [PATCH 18/24] inbox: require ASCII digits for feedmax var Eric Wong
2019-06-04 11:27 ` [PATCH 19/24] solver|viewdiff: restrict digit matches to ASCII Eric Wong
2019-06-04 11:27 ` [PATCH 20/24] www: require ASCII digit for git epoch Eric Wong
2019-06-04 11:27 ` [PATCH 21/24] require ASCII digits for local FS items Eric Wong
2019-06-04 11:27 ` [PATCH 22/24] githttpbackend: require ASCII in path Eric Wong
2019-06-04 11:27 ` [PATCH 23/24] www: require ASCII range for mbox downloads Eric Wong
2019-06-04 11:27 ` [PATCH 24/24] www: require ASCII word characters for CSS filenames Eric Wong
2019-06-05  2:18 ` [PATCH 25/24] tighten up digit matches to ASCII for git output Eric Wong

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190604112748.23598-2-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox