user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
 Warning: Initial query:
 %22overidx: subject_path: allow non-ASCII char in subject matches%22
 returned no results, used:
 "overidx: subject_path: allow non-ASCII char in subject matches"
 instead

Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH] overidx: subject_path: allow non-ASCII char in subject matches
@ 2021-10-06 10:12  7% Eric Wong
  0 siblings, 0 replies; 1+ results
From: Eric Wong @ 2021-10-06 10:12 UTC (permalink / raw)
  To: meta

This should bring us closer to the "Base subject" definition in
IMAP ORDEREDSUBJECT (RFC 5256 2.1).  Larger changes may cause
some breakage (until --reindex).  But for now, a reindex will
prevents the non-ASCII subjects from being normalized to the
same fuzzy "thread" in the thread view.
---
 lib/PublicInbox/OverIdx.pm | 7 ++++---
 lib/PublicInbox/Smsg.pm    | 2 ++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 2e3d4534f125..0c8a4d9ee3f8 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -243,12 +243,13 @@ sub link_refs {
 	$tid;
 }
 
-# normalize subjects so they are suitable as pathnames for URLs
-# XXX: consider for removal
+# normalize subjects somewhat, they used to be ASCII-only but now
+# we use \w for UTF-8 support.  We may still drop it entirely and
+# rely on Xapian for subject matches...
 sub subject_path ($) {
 	my ($subj) = @_;
 	$subj = subject_normalized($subj);
-	$subj =~ s![^a-zA-Z0-9_\.~/\-]+!_!g;
+	$subj =~ s![^\w\.~/\-]+!_!g;
 	lc($subj);
 }
 
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index da8ce590991a..fb28eff7326e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -145,6 +145,8 @@ sub internaldate { # for IMAP
 
 our $REPLY_RE = qr/^re:\s+/i;
 
+# TODO: see RFC 5256 sec 2.1 "Base Subject" and evaluate compatibility
+# w/ existing indices...
 sub subject_normalized ($) {
 	my ($subj) = @_;
 	$subj =~ s/\A\s+//s; # no leading space

^ permalink raw reply related	[relevance 7%]

Results 1-1 of 1 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2021-10-06 10:12  7% [PATCH] overidx: subject_path: allow non-ASCII char in subject matches Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).