user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* brain dump detached/external index so far...
@ 2020-09-13  6:55 Eric Wong
  2020-09-14 16:01 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2020-09-13  6:55 UTC (permalink / raw)
  To: meta

[This should eventually be put into a section 5 manpage
 similar to our existing v1+v2 format manpages]

One feature I've been working on is detached/external indices
for Xapian search.

Currently (and since the earliest days of this project
supporting Xapian), indices were per-inbox.  This allowed
inboxes to be isolated, making it easy to add and remove
inboxes.

The detached/external indices will allows a merging of
several existing inboxes into a single (or several)
virtual inboxes.

Why?

  We want a cross-inbox search in the WWW UI,
  and perhaps an "All Mail" IMAP/JMAP inbox.

Initial idea:

  We already use Xapian shards.  Bump RLIMIT_NOFILE, put all the
  existing shards together, query, and done!

Problem:

  Queries are unusably slow with hundreds of shards:
  https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009815.html
  (and we'd expect hundreds of thousands of shards)

Solution:

  Another index, independent of existing per-inbox indices.
  Sharded to CPU core count (similar to V2).

  over.sqlite3 is pretty useful for sharding + deduplication,
  so it will be used here, too.  msgmap.sqlite3 may be optional,
  since its really only needed for NNTP...

  Existing public-inboxes can be attached to the new index.
  Detaching might not be supported right away...

Downsides:

  More stuff to documente, more stuff for users to learn.

  Increased disk space and page cache utilization.  Eventually,
  the web UI should be able to just use the detached index for
  search.  Unfortunately, IMAP search relies on UIDs which are
  per-inbox, at least.

  Removing an inbox from the index can be tricky, will need a
  "GC" command.

TBD:

  Unindexing from "public-inbox-learn spam" may be tricky.  We
  need a way to prevent an untrusted inbox we're mirroring from
  unindexing a message that isn't marked as deleted from a
  "trusted" inbox.

  Purge + edit support will be similarly tricky, I think.

  We will probably require existing inboxes to have
  indexlevel=basic before an inbox can belong to a
  detached/external index.

Advantages:

  Deduplication built-in.  Cross-posted messages only get
  expensive Xapian data indexed once (multiple List-Id can
  get attached to each message).

  For users who can forgo per-inbox IMAP search, this may
  lead to space savings for WWW search with many inboxes.
  indexlevel=basic for NNTP and IMAP (w/o search) doesn't
  require too much space.

Extra benefits of this approach:

  The potential to provide a transition from v1 to v2 inboxes
  by exposing them as a union of existing inboxes.

  "Private inbox" support with local flags (delete, seen, replied)

  Any number of virtual inboxes for grouping similar projects
  together can be created by combinations of existing v1/v2
  inboxes.

Disclaimer:

  My brain hasn't been working quite right...
  Heat + pandemic + power outages + insects + poor air quality
  have all been taking their toll :<

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: brain dump detached/external index so far...
  2020-09-13  6:55 brain dump detached/external index so far Eric Wong
@ 2020-09-14 16:01 ` Konstantin Ryabitsev
  2020-09-14 20:55   ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Ryabitsev @ 2020-09-14 16:01 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Sun, Sep 13, 2020 at 06:55:50AM +0000, Eric Wong wrote:
> Currently (and since the earliest days of this project
> supporting Xapian), indices were per-inbox.  This allowed
> inboxes to be isolated, making it easy to add and remove
> inboxes.
> 
> The detached/external indices will allows a merging of
> several existing inboxes into a single (or several)
> virtual inboxes.
> 
> Why?
> 
>   We want a cross-inbox search in the WWW UI,
>   and perhaps an "All Mail" IMAP/JMAP inbox.

FYI, this is the most often requested feature for lore.kernel.org, 
alongside with "give me a mbox.gz with all followups, regardless to 
which list they were sent."

I think several virtual inboxes makes more sense than always one global 
search, as people may want to search something like "all Linux kernel 
discussions" or "all gcc/compiler discussions". There could be different 
frontends to indicate which search is running -- e.g.  
"kernel.lore.kernel.org" vs. "gcc.lore.kernel.org".

> Advantages:
> 
>   Deduplication built-in.  Cross-posted messages only get
>   expensive Xapian data indexed once (multiple List-Id can
>   get attached to each message).

As an off-side grumbling, we found out that AWS SES (their email 
processing service) will force-rewrite all message IDs without any 
option to prevent this. So, a single message with multiple recipients 
will arrive with a unique Message-ID to each one of them.

This is so broken, it blows my mind, but AWS doesn't care to fix it.

> Disclaimer:
> 
>   My brain hasn't been working quite right...
>   Heat + pandemic + power outages + insects + poor air quality
>   have all been taking their toll :<

I hope things improve on the West Coast soon. The imagery alone was 
frightening to watch.

Best regards,
-K

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: brain dump detached/external index so far...
  2020-09-14 16:01 ` Konstantin Ryabitsev
@ 2020-09-14 20:55   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-09-14 20:55 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> I think several virtual inboxes makes more sense than always one global 
> search, as people may want to search something like "all Linux kernel 
> discussions" or "all gcc/compiler discussions". There could be different 
> frontends to indicate which search is running -- e.g.  
> "kernel.lore.kernel.org" vs. "gcc.lore.kernel.org".

It might be better to have one big index, still.  Groups can be
Xapian terms which are defined/redefined as-needed for filtering.
It would be helpful when compiler bugs are found in the course of
kernel development.

> As an off-side grumbling, we found out that AWS SES (their email 
> processing service) will force-rewrite all message IDs without any 
> option to prevent this. So, a single message with multiple recipients 
> will arrive with a unique Message-ID to each one of them.
> 
> This is so broken, it blows my mind, but AWS doesn't care to fix it.

Eeeep.  I wonder if there's some benefit for Amazon in doing this...
I can understand why centralized communications providers
would Embrace, Extend, Extinguish email; but I didn't think
they'd have anything to gain by making email worse.
(I'm still anti-monopolist, anyways)

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-14 20:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-13  6:55 brain dump detached/external index so far Eric Wong
2020-09-14 16:01 ` Konstantin Ryabitsev
2020-09-14 20:55   ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).