From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH] overidx: subject_path: allow non-ASCII char in subject matches
Date: Wed, 6 Oct 2021 10:12:21 +0000 [thread overview]
Message-ID: <20211006101221.16215-1-e@80x24.org> (raw)
This should bring us closer to the "Base subject" definition in
IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause
some breakage (until --reindex). But for now, a reindex will
prevents the non-ASCII subjects from being normalized to the
same fuzzy "thread" in the thread view.
---
lib/PublicInbox/OverIdx.pm | 7 ++++---
lib/PublicInbox/Smsg.pm | 2 ++
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 2e3d4534f125..0c8a4d9ee3f8 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -243,12 +243,13 @@ sub link_refs {
$tid;
}
-# normalize subjects so they are suitable as pathnames for URLs
-# XXX: consider for removal
+# normalize subjects somewhat, they used to be ASCII-only but now
+# we use \w for UTF-8 support. We may still drop it entirely and
+# rely on Xapian for subject matches...
sub subject_path ($) {
my ($subj) = @_;
$subj = subject_normalized($subj);
- $subj =~ s![^a-zA-Z0-9_\.~/\-]+!_!g;
+ $subj =~ s![^\w\.~/\-]+!_!g;
lc($subj);
}
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index da8ce590991a..fb28eff7326e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -145,6 +145,8 @@ sub internaldate { # for IMAP
our $REPLY_RE = qr/^re:\s+/i;
+# TODO: see RFC 5256 sec 2.1 "Base Subject" and evaluate compatibility
+# w/ existing indices...
sub subject_normalized ($) {
my ($subj) = @_;
$subj =~ s/\A\s+//s; # no leading space
reply other threads:[~2021-10-06 10:12 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211006101221.16215-1-e@80x24.org \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).