user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH] search: (really) match the behavior of WWW for indexing text
@ 2018-07-30  8:23  4% Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2018-07-30  8:23 UTC (permalink / raw)
  To: meta

Not sure what was going through my mind when I made my first
attempt at this, but we really want to make sure we index all
the text we display in the web view (and presumably anything a
reasonable mail client can display).

Followup-to: 0cf6196025d4e4880cd1ed859257ce21dd3cdcf6
    ("search: match the behavior of WWW for indexing text")
---
 lib/PublicInbox/SearchIdx.pm | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 1d259a8..29868d9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -321,8 +321,7 @@ sub add_xapian ($$$$$) {
 		defined $s or return;
 
 		my (@orig, @quot);
-		my $body = $part->body;
-		my @lines = split(/\n/, $body);
+		my @lines = split(/\n/, $s);
 		while (defined(my $l = shift @lines)) {
 			if ($l =~ /^>/) {
 				$self->index_body(\@orig, $doc) if @orig;
-- 
EW


^ permalink raw reply related	[relevance 4%]

* [PATCH 09/10] search: match the behavior of WWW for indexing text
  2016-09-09  0:01  5% [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
@ 2016-09-09  0:01  7% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

The basic rule is that if it is displayable via our WWW
interface, it should be indexable text for Xapian search.
---
 lib/PublicInbox/SearchIdx.pm | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0e2d225..fb68f4b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -148,7 +148,6 @@ sub add_message {
 
 	my ($doc_id, $old_tid);
 	my $mid = mid_clean(mid_mime($mime));
-	my $ct_msg = $mime->header('Content-Type') || 'text/plain';
 
 	eval {
 		die 'Message-ID too long' if length($mid) > MAX_MID_SIZE;
@@ -181,10 +180,22 @@ sub add_message {
 
 		msg_iter($mime, sub {
 			my ($part, $depth, @idx) = @{$_[0]};
-			my $ct = $part->content_type || $ct_msg;
-
-			# account for filter bugs...
-			$ct =~ m!\btext/plain\b!i or return;
+			my $ct = $part->content_type || 'text/plain';
+
+			return if $ct =~ m!\btext/x?html\b!i;
+
+			my $s = eval { $part->body_str };
+			if ($@) {
+				if ($ct =~ m!\btext/plain\b!i) {
+					# Try to assume UTF-8 because Alpine
+					# seems to do wacky things and set
+					# charset=X-UNKNOWN
+					$part->charset_set('UTF-8');
+					$s = eval { $part->body_str };
+					$s = $part->body if $@;
+				}
+			}
+			defined $s or return;
 
 			my (@orig, @quot);
 			my $body = $part->body;
-- 
EW


^ permalink raw reply related	[relevance 7%]

* [PATCH 0/10] search: more mairix prefix compatibility
@ 2016-09-09  0:01  5% Eric Wong
  2016-09-09  0:01  7% ` [PATCH 09/10] search: match the behavior of WWW for indexing text Eric Wong
  0 siblings, 1 reply; 3+ results
From: Eric Wong @ 2016-09-09  0:01 UTC (permalink / raw)
  To: meta

This brings us closer to the behavior of mairix(1) for search
by supporting n:, t:, c:, f:, tc:, tcf:, n:, b:, and bs:
prefixes as documented in the mairix(1) manpage.

We also introduce the use of q: and nq: prefixes for quoted and
non-quoted text, respectively.

There is a schema version change in [PATCH 7/10] to maintain
compatibility with Debian 7.x wheezy installs.  The in-place
reindexing would've been expensive anyways, so perhaps the
schema bump is a good idea, anyways, as creating a fresh index
should be faster than --reindex.

Eric Wong (10):
      search: allow searching user fields (To/Cc/From)
      search: drop longer subject: prefix for search
      search: more granular message body searching
      search: fix space regressions from recent changes
      search: match quote detection behavior of view
      search: increase term positions for each quoted hunk
      search: fix compatibility with Debian wheezy
      search: avoid mindlessly calling body_set
      search: match the behavior of WWW for indexing text
      search: index attachment filenames

 lib/PublicInbox/Search.pm    |  32 +++++++++---
 lib/PublicInbox/SearchIdx.pm | 104 ++++++++++++++++++++++++-------------
 t/search.t                   | 120 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 206 insertions(+), 50 deletions(-)


^ permalink raw reply	[relevance 5%]

Results 1-3 of 3 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2016-09-09  0:01  5% [PATCH 0/10] search: more mairix prefix compatibility Eric Wong
2016-09-09  0:01  7% ` [PATCH 09/10] search: match the behavior of WWW for indexing text Eric Wong
2018-07-30  8:23  4% [PATCH] search: (really) " Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).