about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchMsg.pm
DateCommit message (Collapse)
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-07searchmsg: allow lines (and bytes) to be zero
We will occasionally see legit messages with zero lines, be sure we index that count for NNTP clients. I'm not sure about bytes being zero (aside from purged messages), but we should've dealt with that earlier up the stack.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-24search: support SWIG-generated Xapian.pm
Xapian upstream is slowly phasing out the XS-based Search::Xapian in favor of the SWIG-generated "Xapian" package. While Debian and both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian" binding. More information about the status of the "Xapian" Perl module here: https://trac.xapian.org/ticket/523
2019-10-28search: support multiple From/To/Cc/Subject headers
We can easily support searching on messages with multiple From/To/Cc/Subject headers just like we do with multiple Message-ID headers. This matches the normal mutt pager display behavior.
2019-09-09run update-copyrights from gnulib for 2019
2019-06-13searchmsg: remove unused ->get subroutine
It's obsolete and unusable since our search schema version 15; which made the Xapian document ID correspond to the NNTP article number.
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-01-10Merge commit 'mem'
* commit 'mem': view: more culling for search threads over: cull unneeded fields for get_thread searchmsg: remove unused fields for PSGI in Xapian results searchview: drop unused {seen} hashref searchmsg: remove Xapian::Document field searchmsg: get rid of termlist scanning for mid httpd: remove psgix.harakiri reference
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2019-01-08over: cull unneeded fields for get_thread
On a certain ugly /$INBOX/$MESSAGE_ID/T/ endpoint with 1000 messages in the thread, this cuts memory usage from 2.5M to 1.9M (which still isn't great, but it's a start).
2019-01-08searchmsg: remove unused fields for PSGI in Xapian results
These fields are only necessary in NNTP and not even stored in Xapian; so keeping them around for the PSGI web UI search results wastes nearly 80K when loading large result sets.
2019-01-08searchmsg: remove Xapian::Document field
We don't need to be carrying this around with the many SearchMsg objects we have. This saves about 20K from a large SearchView "&x=t" response.
2019-01-08searchmsg: get rid of termlist scanning for mid
It doesn't seem to be used anywhere
2018-04-20disallow "\t" and "\n" in OVER headers
For Subject/To/Cc/From headers, we squeeze them to a space (' '). For Message-IDs (including References/In-Reply-To), '\t', '\n', '\r' are deleted since some MUAs might screw them up: https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw
2018-04-07store less data in the Xapian document
Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-06search: index and allow searching by date-time
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO. https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-05searchmsg: remove unused `tid' and `path' methods
These internal attributes are not exposed and no longer used in our APIs.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01search: reduce columns stored in Xapian
We can store :bytes and :lines in doc_data since we never sort or search by them. We don't have much use for the Date: stamp at the moment, either.
2018-03-29searchmsg: document why we store To: and Cc: for NNTP
Otherwise I would forget and be tempted to remove them.
2018-03-27view: depend on SearchMsg for Message-ID
Since we need to handle messages with multiple and duplicate Message-ID headers, our thread skeleton display must account for that. Since we have a "preferred" Message-ID in case of conflicts, use it as the UUID in an Atom feed so readers do not get confused by conflicts.
2018-03-27view: permalink (per-message) view shows multiple messages
This needs tests and further refinement, but current tests pass.
2018-03-22fix syntax warnings
I keep forgetting to run "make syntax"
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03searchidx: use add_boolean_term for internal terms
Aside from the Message-Id ('Q'), these terms do not appear in content and thus have no business contributing to the Xapian document length. Thanks-to Olly Betts for the tip on xapian-discuss <20180228004400.GU12724@survex.com>
2018-03-03searchidx: support indexing multiple MIDs
It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-14search: free up 'Q' prefix for a real unique identifier
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump later on... (more changes planned which can affect v1)
2018-02-13searchmsg: add mid_mime import for _extract_mid
Oops, I guess this code was never called and may not be needed. But for now, import it so it can run properly.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-10-03search: try to fill in ghosts when generating thread skeleton
Since we attempt to fill in threads by Subject, our thread skeletons can cross actual thread IDs, leading to the possibility of false ghosts showing up in the skeleton. Try to fill in the ghosts as well as possible by performing a message lookup.
2017-06-14search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-01-07search: remove subject_summary
Apparently it never actually got used, and the world seems fine without it, so we can drop it. While we're at it, consider removing our subject_path usage from existence, too. We are not using fancy subject-line based URLs, here.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2017-01-07remove incorrect comment about strftime + locales
We only need strftime to be locale-independent when generating dates for email and HTTP headers. Purely numeric dates can use strftime for ease-of-readability.
2016-12-20searchmsg: remove ensure_metadata
Instead, only preload the ->mid field for threading, as we only need ->thread and ->path once in Search->get_thread (but we will need the ->mid field repeatedly). This more than doubles View->load_results performance on according to thread-all on an inbox with over 300K messages.
2016-12-17searchmsg: do not memoize {date} field
We only generate the ->date once in NNTP, so creating the hash entry is a waste.
2016-12-17searchmsg: remove locale-dependency for ->date
strftime is locale-dependent, which can cause surprising failures for some users.
2016-12-13searchmsg: remove unused EPOCH_822 constant
This hasn't been needed since our Email::Abstract removal for message threading.
2016-10-05thread: remove Email::Abstract wrapping
This roughly doubles performance due to the reduction in object creation and abstraction layers.
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-06-25address: remove Address::from_name
Address::names is sufficient to handle what from_name did.
2016-06-20www: improve topic view by scanning for ghosts
This should help avoid having too many fake top-level messages in the topic view since we only have a partial window for threading results.
2016-05-30use utf8::{encode,decode} for in-place transforms
No need to duplicate the string when transforming it; learned from studying SpamAssassin 3.4.1