From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: brain dump detached/external index so far...
Date: Sun, 13 Sep 2020 06:55:50 +0000 [thread overview]
Message-ID: <20200913065550.GA2337@dcvr> (raw)
[This should eventually be put into a section 5 manpage
similar to our existing v1+v2 format manpages]
One feature I've been working on is detached/external indices
for Xapian search.
Currently (and since the earliest days of this project
supporting Xapian), indices were per-inbox. This allowed
inboxes to be isolated, making it easy to add and remove
inboxes.
The detached/external indices will allows a merging of
several existing inboxes into a single (or several)
virtual inboxes.
Why?
We want a cross-inbox search in the WWW UI,
and perhaps an "All Mail" IMAP/JMAP inbox.
Initial idea:
We already use Xapian shards. Bump RLIMIT_NOFILE, put all the
existing shards together, query, and done!
Problem:
Queries are unusably slow with hundreds of shards:
https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009815.html
(and we'd expect hundreds of thousands of shards)
Solution:
Another index, independent of existing per-inbox indices.
Sharded to CPU core count (similar to V2).
over.sqlite3 is pretty useful for sharding + deduplication,
so it will be used here, too. msgmap.sqlite3 may be optional,
since its really only needed for NNTP...
Existing public-inboxes can be attached to the new index.
Detaching might not be supported right away...
Downsides:
More stuff to documente, more stuff for users to learn.
Increased disk space and page cache utilization. Eventually,
the web UI should be able to just use the detached index for
search. Unfortunately, IMAP search relies on UIDs which are
per-inbox, at least.
Removing an inbox from the index can be tricky, will need a
"GC" command.
TBD:
Unindexing from "public-inbox-learn spam" may be tricky. We
need a way to prevent an untrusted inbox we're mirroring from
unindexing a message that isn't marked as deleted from a
"trusted" inbox.
Purge + edit support will be similarly tricky, I think.
We will probably require existing inboxes to have
indexlevel=basic before an inbox can belong to a
detached/external index.
Advantages:
Deduplication built-in. Cross-posted messages only get
expensive Xapian data indexed once (multiple List-Id can
get attached to each message).
For users who can forgo per-inbox IMAP search, this may
lead to space savings for WWW search with many inboxes.
indexlevel=basic for NNTP and IMAP (w/o search) doesn't
require too much space.
Extra benefits of this approach:
The potential to provide a transition from v1 to v2 inboxes
by exposing them as a union of existing inboxes.
"Private inbox" support with local flags (delete, seen, replied)
Any number of virtual inboxes for grouping similar projects
together can be created by combinations of existing v1/v2
inboxes.
Disclaimer:
My brain hasn't been working quite right...
Heat + pandemic + power outages + insects + poor air quality
have all been taking their toll :<
next reply other threads:[~2020-09-13 6:55 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-13 6:55 Eric Wong [this message]
2020-09-14 16:01 ` brain dump detached/external index so far Konstantin Ryabitsev
2020-09-14 20:55 ` Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200913065550.GA2337@dcvr \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).