From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id AA2FD1F55B for ; Mon, 1 Jun 2020 22:10:35 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] search: index byte size of a message for IMAP search Date: Mon, 1 Jun 2020 22:10:35 +0000 Message-Id: <20200601221035.31273-1-e@yhbt.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: Searching for messages smaller than a certain size is allowed by offlineimap(1), mbsync(1), and possibly other tools. Maybe public-inbox-watch will support it, too. I don't see a reason to expose searching by size via WWW search right now (but maybe in the future, I could be convinced to). Note: we only store the byte-size of the message in git, this is typically LF-only and we won't have the correct size after CRLF conversion for NNTP or IMAP. However, since most folks using tools like mbsync(1) and offlineimap(1) would be on *nix systems where LF-only is expected, I don't see the point of spending LoC or CPU cycles to count bytes for CRLF on the wire. --- lib/PublicInbox/Search.pm | 12 ++++++++---- lib/PublicInbox/SearchIdx.pm | 2 ++ t/search.t | 6 ++++++ 3 files changed, 16 insertions(+), 4 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index cb669e8733e..f2d3b92dc82 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -5,12 +5,16 @@ # Read-only search interface for use by the web and NNTP interfaces package PublicInbox::Search; use strict; -use warnings; # values for searching -use constant TS => 0; # Received: header in Unix time -use constant YYYYMMDD => 1; # Date: header for searching in the WWW UI -use constant DT => 2; # Date: YYYYMMDDHHMMSS +use constant { + TS => 0, # Received: header in Unix time (IMAP INTERNALDATE) + YYYYMMDD => 1, # Date: header for searching in the WWW UI + DT => 2, # Date: YYYYMMDDHHMMSS + BYTES => 3, # IMAP RFC822.SIZE + # TODO + # REPLYCNT => 4, # IMAP ANSWERED +}; use PublicInbox::Smsg; use PublicInbox::Over; diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index f10a9104e78..5c161b9accf 100644 --- a/lib/PublicInbox/SearchIdx.pm +++ b/lib/PublicInbox/SearchIdx.pm @@ -341,6 +341,7 @@ sub add_xapian ($$$$) { add_val($doc, PublicInbox::Search::YYYYMMDD(), $yyyymmdd); my $dt = strftime('%Y%m%d%H%M%S', @ds); add_val($doc, PublicInbox::Search::DT(), $dt); + add_val($doc, PublicInbox::Search::BYTES(), $smsg->{bytes}); my $tg = term_generator($self); $tg->set_document($doc); @@ -388,6 +389,7 @@ sub add_message { # v1 and tests only: $smsg->populate($hdr, $self); + $smsg->{bytes} //= length($mime->as_string); eval { # order matters, overview stores every possible piece of diff --git a/t/search.t b/t/search.t index 6cf2bc2d6b4..cf3254169ca 100644 --- a/t/search.t +++ b/t/search.t @@ -318,6 +318,12 @@ $ibx->with_umask(sub { foreach my $m ($mset->items) { my $smsg = $ro->{over_ro}->get_art($m->get_docid); like($smsg->{to}, qr/\blist\@example\.com\b/, 'to appears'); + my $doc = $m->get_document; + my $col = PublicInbox::Search::BYTES(); + my $bytes = PublicInbox::Smsg::get_val($doc, $col); + like($bytes, qr/\A[0-9]+\z/, '$bytes stored as digit'); + ok($bytes > 0, '$bytes is > 0'); + is($bytes, $smsg->{bytes}, 'bytes Xapian value matches Over'); } $mset = $ro->query('tc:list@example.com', {mset => 1});