From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 8F56B1F8C8 for ; Wed, 6 Oct 2021 10:12:21 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] overidx: subject_path: allow non-ASCII char in subject matches Date: Wed, 6 Oct 2021 10:12:21 +0000 Message-Id: <20211006101221.16215-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: This should bring us closer to the "Base subject" definition in IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause some breakage (until --reindex). But for now, a reindex will prevents the non-ASCII subjects from being normalized to the same fuzzy "thread" in the thread view. --- lib/PublicInbox/OverIdx.pm | 7 ++++--- lib/PublicInbox/Smsg.pm | 2 ++ 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm index 2e3d4534f125..0c8a4d9ee3f8 100644 --- a/lib/PublicInbox/OverIdx.pm +++ b/lib/PublicInbox/OverIdx.pm @@ -243,12 +243,13 @@ sub link_refs { $tid; } -# normalize subjects so they are suitable as pathnames for URLs -# XXX: consider for removal +# normalize subjects somewhat, they used to be ASCII-only but now +# we use \w for UTF-8 support. We may still drop it entirely and +# rely on Xapian for subject matches... sub subject_path ($) { my ($subj) = @_; $subj = subject_normalized($subj); - $subj =~ s![^a-zA-Z0-9_\.~/\-]+!_!g; + $subj =~ s![^\w\.~/\-]+!_!g; lc($subj); } diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm index da8ce590991a..fb28eff7326e 100644 --- a/lib/PublicInbox/Smsg.pm +++ b/lib/PublicInbox/Smsg.pm @@ -145,6 +145,8 @@ sub internaldate { # for IMAP our $REPLY_RE = qr/^re:\s+/i; +# TODO: see RFC 5256 sec 2.1 "Base Subject" and evaluate compatibility +# w/ existing indices... sub subject_normalized ($) { my ($subj) = @_; $subj =~ s/\A\s+//s; # no leading space