user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Nonlinear history?
@ 2017-08-16 21:01 Stefan Beller
  2017-08-23  1:42 ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Beller @ 2017-08-16 21:01 UTC (permalink / raw)
  To: Eric Wong, meta

So I happened to search an old post of mine today,
specifically I knew only a couple of bits of it:
* I authored a patch series, that had a given
  string in the name ("protocolv2")
* I was looking for an answer by Peff

To find the post in question I used both the local git mailing list repository
(as cloned from https://public-inbox.org/git) to find the starting point[1]
as well as the online list to see the relations between posts, such
that I finally arrived at [2].

However in the process of searching locally I wondered if the
repository data could be organized better, instead of linearly.

So what if the git history would reflect the parent relationships
of the emails? Essentially each email is comparable to
a topic branch in the git workflow (potentially with other
series/email on top of it). Each topic would be merged
to master immediately, such that the first parent master
branch history consists of merges only; the second parent
is the new ingested email, which is either a root-commit
(when a new topic is started), or a commit building on top
of another commit (which contains the email it is responding
to; that other commit is merged to master already).

Has this idea been come up before or even discussed before?

Thanks,
Stefan

[1] I really like the search by author feature!
[2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nonlinear history?
  2017-08-16 21:01 Nonlinear history? Stefan Beller
@ 2017-08-23  1:42 ` Eric Wong
  2017-08-23 18:29   ` Stefan Beller
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2017-08-23  1:42 UTC (permalink / raw)
  To: Stefan Beller; +Cc: meta

Stefan Beller <sbeller@google.com> wrote:
> So I happened to search an old post of mine today,
> specifically I knew only a couple of bits of it:
> * I authored a patch series, that had a given
>   string in the name ("protocolv2")
> * I was looking for an answer by Peff
> 
> To find the post in question I used both the local git mailing list repository
> (as cloned from https://public-inbox.org/git) to find the starting point[1]
> as well as the online list to see the relations between posts, such
> that I finally arrived at [2].
> 
> However in the process of searching locally I wondered if the
> repository data could be organized better, instead of linearly.

The reason it is organized linearly is so it can be
up-to-the-minute and fetched incrementally as soon as mail
arrives (or it is marked as spam).

However, I've been considering after-the-fact organization, too
(similar to how packing works in git).  However, it's not
to optimize search, but to improve storage efficiency:

1) purge spam messages from history
2) squash to reduce tree and commit objects
3) perhaps choose smarter filenames which can improve
   packing heuristics

So I'm also strongly favoring moving away from the 2/38
Message-ID naming scheme we currently use, too.

> So what if the git history would reflect the parent relationships
> of the emails? Essentially each email is comparable to
> a topic branch in the git workflow (potentially with other
> series/email on top of it). Each topic would be merged
> to master immediately, such that the first parent master
> branch history consists of merges only; the second parent
> is the new ingested email, which is either a root-commit
> (when a new topic is started), or a commit building on top
> of another commit (which contains the email it is responding
> to; that other commit is merged to master already).

That may not work well because emails arrive out-of-order,
especially when many are sent in rapid succession with
"git send-email".  I've had to make bugfixes to some of the
Perl+Xapian logic to deal with OOO message delivery, too :)

And having to map Message-ID to a particular commit would
require extra overhead to keep track of parents, no?

So, I think what you're describing already happens in the Perl
search code as every message gets assigned a thread_id when it
is indexed in Xapian.  I suppose you still cannot look at AGPL-3
code (being a Googler), but I stole the logic from notmuch
(C++, GPL-3+, no 'A') circa 2015/2016(*).  I believe mairix(**)
uses similar logic for mapping messages to thread IDs, too.

So the thread skeleton you see at the bottom of every message
page is done using a boolean thread_id search OR-ed with a
Subject search.

In short, I would like to depend more on the search engine for
logic and keep that flexible; but continue to keep the (git)
storage layer "dumb".  The smarts would be in Xapian, which can
be tuned and refined after-the-fact with minimal refetching.

And Xapian could also be swapped out for alternative search
engines, too (Groonga, maybe).  I consider it having a similar
in philosophy to git itself w.r.t. storage optimization,
merge strategies, and rename detection.

> Has this idea been come up before or even discussed before?

Not exactly what you're asking, but I guess what I described is
similar to what we already do via Xapian.

> Thanks,
> Stefan
> 
> [1] I really like the search by author feature!
> [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/

Thanks.   Fwiw, I did not know Xapian at all when I started this
project, but it's become one of my favorite things about it :)
And I stole the f:, tc:, s: and several other prefixes from
mairix.



(*) I've never used notmuch as I don't use Maildir for long-term
    archival and I don't think notmuch ever supported anything else.
    I also don't know C++, so maybe I interpreted wrong :x

(**) I still use mairix, but I'm considering starting a
     separate project to replace it for my personal mail(***)
     It may also be useful for prototyping future public-inbox
     changes, too.

(***) For private emails, I want IMAP support + offline
      memoization instead of my current mairix + offlineimap +
      archive-old-mail-to-mboxrd script.  I'd still want
      to rely on git for message caching/memoization.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nonlinear history?
  2017-08-23  1:42 ` Eric Wong
@ 2017-08-23 18:29   ` Stefan Beller
  2017-08-23 19:40     ` Eric Wong
  2017-08-23 20:06     ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Stefan Beller @ 2017-08-23 18:29 UTC (permalink / raw)
  To: Eric Wong, Jeff King; +Cc: meta

On Tue, Aug 22, 2017 at 6:42 PM, Eric Wong <e@80x24.org> wrote:
> Stefan Beller <sbeller@google.com> wrote:
>> So I happened to search an old post of mine today,
>> specifically I knew only a couple of bits of it:
>> * I authored a patch series, that had a given
>>   string in the name ("protocolv2")
>> * I was looking for an answer by Peff
>>
>> To find the post in question I used both the local git mailing list repository
>> (as cloned from https://public-inbox.org/git) to find the starting point[1]
>> as well as the online list to see the relations between posts, such
>> that I finally arrived at [2].
>>
>> However in the process of searching locally I wondered if the
>> repository data could be organized better, instead of linearly.
>
> The reason it is organized linearly is so it can be
> up-to-the-minute and fetched incrementally as soon as mail
> arrives (or it is marked as spam).

So the design decision is to be as fast as possible on
relaying the message, reducing the time from receiving
to publishing?

> However, I've been considering after-the-fact organization, too
> (similar to how packing works in git).  However, it's not
> to optimize search, but to improve storage efficiency:
>
> 1) purge spam messages from history

I thought this would already happen. every once in a while
I get a spam mail via the mailing list and the last time I checked
it was not to be found in the public-inbox archive, so I assumed
spam filtering is already part of the decision for each new
message.

> 2) squash to reduce tree and commit objects

This would only work for patch series, such that the
author is kept (and time is only skewed by a few seconds)

> 3) perhaps choose smarter filenames which can improve
>    packing heuristics
>
> So I'm also strongly favoring moving away from the 2/38
> Message-ID naming scheme we currently use, too.

I wondered if the email message ID (i.e.
20170823014239.GA4113@starla/) is a good base for a
naming scheme? (sharding into directories would need to
be added. Maybe even 'in reverse'? That would help
to separate mails by host/sender).

>> So what if the git history would reflect the parent relationships
>> of the emails? Essentially each email is comparable to
>> a topic branch in the git workflow (potentially with other
>> series/email on top of it). Each topic would be merged
>> to master immediately, such that the first parent master
>> branch history consists of merges only; the second parent
>> is the new ingested email, which is either a root-commit
>> (when a new topic is started), or a commit building on top
>> of another commit (which contains the email it is responding
>> to; that other commit is merged to master already).
>
> That may not work well because emails arrive out-of-order,
> especially when many are sent in rapid succession with
> "git send-email".  I've had to make bugfixes to some of the
> Perl+Xapian logic to deal with OOO message delivery, too :)
>
> And having to map Message-ID to a particular commit would
> require extra overhead to keep track of parents, no?

Yes, but that is already a problem while using the data as a viewer.
Every time I visit https://public-inbox.org/meta/20170823014239.GA4113@starla/
the server needs to compute the "thread overview". If the git history
would be grouped by message relationships, the querying could be
done via git, which -now that I think about it-  may not actually be
cheaper than searching in the "unstructured" data as of now.

The receiving out of order seems to be a problem in this design.

Note that Peff seems to have build tooling around public-inbox
(https://public-inbox.org/git/20170823154747.vxtyy2v2ofkxwrkx@sigill.intra.peff.net/)
that would produce this precise lookup already.

> So, I think what you're describing already happens in the Perl
> search code as every message gets assigned a thread_id when it
> is indexed in Xapian.  I suppose you still cannot look at AGPL-3
> code (being a Googler), but I stole the logic from notmuch
> (C++, GPL-3+, no 'A') circa 2015/2016(*).  I believe mairix(**)
> uses similar logic for mapping messages to thread IDs, too.
>
> So the thread skeleton you see at the bottom of every message
> page is done using a boolean thread_id search OR-ed with a
> Subject search.

That sounds efficient.

> In short, I would like to depend more on the search engine for
> logic and keep that flexible; but continue to keep the (git)
> storage layer "dumb".  The smarts would be in Xapian, which can
> be tuned and refined after-the-fact with minimal refetching.

eh. I see your point (and motivation as the maintainer of public
inbox).

As a user I would have hoped for a "smart" git layer, as I like
searching the data using git tools, which would be enhanced if
the git layer is not "dumb".

> And Xapian could also be swapped out for alternative search
> engines, too (Groonga, maybe).  I consider it having a similar
> in philosophy to git itself w.r.t. storage optimization,
> merge strategies, and rename detection.

So dumb data, with a smart (and potentially even smarter
in the future) program on top.

>
>> Has this idea been come up before or even discussed before?
>
> Not exactly what you're asking, but I guess what I described is
> similar to what we already do via Xapian.
>
>> Thanks,
>> Stefan
>>
>> [1] I really like the search by author feature!
>> [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/
>
> Thanks.   Fwiw, I did not know Xapian at all when I started this
> project, but it's become one of my favorite things about it :)
> And I stole the f:, tc:, s: and several other prefixes from
> mairix.
>
>
>
> (*) I've never used notmuch as I don't use Maildir for long-term
>     archival and I don't think notmuch ever supported anything else.
>     I also don't know C++, so maybe I interpreted wrong :x
>
> (**) I still use mairix, but I'm considering starting a
>      separate project to replace it for my personal mail(***)
>      It may also be useful for prototyping future public-inbox
>      changes, too.
>
> (***) For private emails, I want IMAP support + offline
>       memoization instead of my current mairix + offlineimap +
>       archive-old-mail-to-mboxrd script.  I'd still want
>       to rely on git for message caching/memoization.

Thanks for your considerations.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nonlinear history?
  2017-08-23 18:29   ` Stefan Beller
@ 2017-08-23 19:40     ` Eric Wong
  2017-08-23 20:06     ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2017-08-23 19:40 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Jeff King, meta

Stefan Beller <sbeller@google.com> wrote:
> On Tue, Aug 22, 2017 at 6:42 PM, Eric Wong <e@80x24.org> wrote:
> > Stefan Beller <sbeller@google.com> wrote:
> >> So I happened to search an old post of mine today,
> >> specifically I knew only a couple of bits of it:
> >> * I authored a patch series, that had a given
> >>   string in the name ("protocolv2")
> >> * I was looking for an answer by Peff
> >>
> >> To find the post in question I used both the local git mailing list repository
> >> (as cloned from https://public-inbox.org/git) to find the starting point[1]
> >> as well as the online list to see the relations between posts, such
> >> that I finally arrived at [2].
> >>
> >> However in the process of searching locally I wondered if the
> >> repository data could be organized better, instead of linearly.
> >
> > The reason it is organized linearly is so it can be
> > up-to-the-minute and fetched incrementally as soon as mail
> > arrives (or it is marked as spam).
> 
> So the design decision is to be as fast as possible on
> relaying the message, reducing the time from receiving
> to publishing?

Yes.  I encourage folks to run "git fetch" as often as possible
since a server can die at any time.

> > However, I've been considering after-the-fact organization, too
> > (similar to how packing works in git).  However, it's not
> > to optimize search, but to improve storage efficiency:
> >
> > 1) purge spam messages from history
> 
> I thought this would already happen. every once in a while
> I get a spam mail via the mailing list and the last time I checked
> it was not to be found in the public-inbox archive, so I assumed
> spam filtering is already part of the decision for each new
> message.

There's already filtering via SpamAssassin on my server, as well
as whatever vger uses, but some spam will always slip through.

You can audit my manual spam removals via: git log -p --diff-filter=D

See https://public-inbox.org/dc-dlvr-spam-flow.txt for implementation.
And I will always encourage independent audits and independently-run
instances to ensure I'm not censoring anything (or let a cat near
an unlocked keyboard :x)

> > 2) squash to reduce tree and commit objects
> 
> This would only work for patch series, such that the
> author is kept (and time is only skewed by a few seconds)

Actually, we could periodically squash all history.
But that would make auditing message removals more difficult
from a third party, so maybe not a good idea...

> > 3) perhaps choose smarter filenames which can improve
> >    packing heuristics
> >
> > So I'm also strongly favoring moving away from the 2/38
> > Message-ID naming scheme we currently use, too.
> 
> I wondered if the email message ID (i.e.
> 20170823014239.GA4113@starla/) is a good base for a
> naming scheme? (sharding into directories would need to
> be added. Maybe even 'in reverse'? That would help
> to separate mails by host/sender).

I was thinking about using normalized[1] subject somehow;
perhaps updating/replacing the older message since (AFAIK)
basename is a packing heuristic, and newer messages usually
quote part of the original, which should help with deltas.

That would require different handling of spam removals, though,
since some spam may have the same subject as a non-spam and
recording a deletion in history might not be reliable that we
really want to stop showing a message.

[1] - "Re: " prefixes removed, and possibly whitespace normalized

> >> So what if the git history would reflect the parent relationships
> >> of the emails? Essentially each email is comparable to
> >> a topic branch in the git workflow (potentially with other
> >> series/email on top of it). Each topic would be merged
> >> to master immediately, such that the first parent master
> >> branch history consists of merges only; the second parent
> >> is the new ingested email, which is either a root-commit
> >> (when a new topic is started), or a commit building on top
> >> of another commit (which contains the email it is responding
> >> to; that other commit is merged to master already).
> >
> > That may not work well because emails arrive out-of-order,
> > especially when many are sent in rapid succession with
> > "git send-email".  I've had to make bugfixes to some of the
> > Perl+Xapian logic to deal with OOO message delivery, too :)
> >
> > And having to map Message-ID to a particular commit would
> > require extra overhead to keep track of parents, no?
> 
> Yes, but that is already a problem while using the data as a viewer.
> Every time I visit https://public-inbox.org/meta/20170823014239.GA4113@starla/
> the server needs to compute the "thread overview". If the git history
> would be grouped by message relationships, the querying could be
> done via git, which -now that I think about it-  may not actually be
> cheaper than searching in the "unstructured" data as of now.

Right.  git itself isn't optimized for querying, and even doing
the 2/38 tree lookups was rather expensive, too; so we store
git object IDs directly in Xapian, now:
https://public-inbox.org/meta/20160805010300.7053-1-e@80x24.org/

Querying is what Xapian was designed for, anyways, and it seems
good at it.

> > In short, I would like to depend more on the search engine for
> > logic and keep that flexible; but continue to keep the (git)
> > storage layer "dumb".  The smarts would be in Xapian, which can
> > be tuned and refined after-the-fact with minimal refetching.
> 
> eh. I see your point (and motivation as the maintainer of public
> inbox).
> 
> As a user I would have hoped for a "smart" git layer, as I like
> searching the data using git tools, which would be enhanced if
> the git layer is not "dumb".

I can understand your wishes; but I consider it a goal to make
the data accessible, first.  And at least some of the common
"git log" stuff still works for now (but won't if we start
squashing history).

That way, different developers can have different priorities
w.r.t. search.  public-inbox has patch-specific searching
functionality optimized for vger workflows right now; but
perhaps somebody else could fork it and tune it for searching
other things in email.

> > And Xapian could also be swapped out for alternative search
> > engines, too (Groonga, maybe).  I consider it having a similar
> > in philosophy to git itself w.r.t. storage optimization,
> > merge strategies, and rename detection.
> 
> So dumb data, with a smart (and potentially even smarter
> in the future) program on top.

Exactly.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nonlinear history?
  2017-08-23 18:29   ` Stefan Beller
  2017-08-23 19:40     ` Eric Wong
@ 2017-08-23 20:06     ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Jeff King @ 2017-08-23 20:06 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Eric Wong, meta

On Wed, Aug 23, 2017 at 11:29:24AM -0700, Stefan Beller wrote:

> Note that Peff seems to have build tooling around public-inbox
> (https://public-inbox.org/git/20170823154747.vxtyy2v2ofkxwrkx@sigill.intra.peff.net/)
> that would produce this precise lookup already.

It's not really built around public-inbox. I just like public-inbox URLs
because they use global identifiers, which means I can index them into
other systems.

My setup is basically maildir (backfilled from gmane long ago and kept
up to date with my subscription), indexed by mairix, and a script that
does m{https?:/public-inbox.org/git/(\S+)} on mail contents and and runs
"mairix m:$1" on the result.

It also looks for gmane.org URLs and runs

  gunzip -c ~/.gmane-to-mid.gz | grep "^${id}\$"

Not exactly high-tech, but it was easy to write and linear search is
good enough for personal use.

I used to do the same thing when gmane was up by resolving the article
numbers at gmane. But doing it without hitting the network is nicer
anyway, and of course the online method doesn't work anymore. :)

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-08-23 20:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-16 21:01 Nonlinear history? Stefan Beller
2017-08-23  1:42 ` Eric Wong
2017-08-23 18:29   ` Stefan Beller
2017-08-23 19:40     ` Eric Wong
2017-08-23 20:06     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).