user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: Stefan Beller <sbeller@google.com>
Cc: meta@public-inbox.org
Subject: Re: Nonlinear history?
Date: Wed, 23 Aug 2017 01:42:39 +0000	[thread overview]
Message-ID: <20170823014239.GA4113@starla> (raw)
In-Reply-To: <CAGZ79kZW6O_wCZRMrWDc1yXvQzTDbFOLrcjt=81XGJj=VUjBzw@mail.gmail.com>

Stefan Beller <sbeller@google.com> wrote:
> So I happened to search an old post of mine today,
> specifically I knew only a couple of bits of it:
> * I authored a patch series, that had a given
>   string in the name ("protocolv2")
> * I was looking for an answer by Peff
> 
> To find the post in question I used both the local git mailing list repository
> (as cloned from https://public-inbox.org/git) to find the starting point[1]
> as well as the online list to see the relations between posts, such
> that I finally arrived at [2].
> 
> However in the process of searching locally I wondered if the
> repository data could be organized better, instead of linearly.

The reason it is organized linearly is so it can be
up-to-the-minute and fetched incrementally as soon as mail
arrives (or it is marked as spam).

However, I've been considering after-the-fact organization, too
(similar to how packing works in git).  However, it's not
to optimize search, but to improve storage efficiency:

1) purge spam messages from history
2) squash to reduce tree and commit objects
3) perhaps choose smarter filenames which can improve
   packing heuristics

So I'm also strongly favoring moving away from the 2/38
Message-ID naming scheme we currently use, too.

> So what if the git history would reflect the parent relationships
> of the emails? Essentially each email is comparable to
> a topic branch in the git workflow (potentially with other
> series/email on top of it). Each topic would be merged
> to master immediately, such that the first parent master
> branch history consists of merges only; the second parent
> is the new ingested email, which is either a root-commit
> (when a new topic is started), or a commit building on top
> of another commit (which contains the email it is responding
> to; that other commit is merged to master already).

That may not work well because emails arrive out-of-order,
especially when many are sent in rapid succession with
"git send-email".  I've had to make bugfixes to some of the
Perl+Xapian logic to deal with OOO message delivery, too :)

And having to map Message-ID to a particular commit would
require extra overhead to keep track of parents, no?

So, I think what you're describing already happens in the Perl
search code as every message gets assigned a thread_id when it
is indexed in Xapian.  I suppose you still cannot look at AGPL-3
code (being a Googler), but I stole the logic from notmuch
(C++, GPL-3+, no 'A') circa 2015/2016(*).  I believe mairix(**)
uses similar logic for mapping messages to thread IDs, too.

So the thread skeleton you see at the bottom of every message
page is done using a boolean thread_id search OR-ed with a
Subject search.

In short, I would like to depend more on the search engine for
logic and keep that flexible; but continue to keep the (git)
storage layer "dumb".  The smarts would be in Xapian, which can
be tuned and refined after-the-fact with minimal refetching.

And Xapian could also be swapped out for alternative search
engines, too (Groonga, maybe).  I consider it having a similar
in philosophy to git itself w.r.t. storage optimization,
merge strategies, and rename detection.

> Has this idea been come up before or even discussed before?

Not exactly what you're asking, but I guess what I described is
similar to what we already do via Xapian.

> Thanks,
> Stefan
> 
> [1] I really like the search by author feature!
> [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/

Thanks.   Fwiw, I did not know Xapian at all when I started this
project, but it's become one of my favorite things about it :)
And I stole the f:, tc:, s: and several other prefixes from
mairix.



(*) I've never used notmuch as I don't use Maildir for long-term
    archival and I don't think notmuch ever supported anything else.
    I also don't know C++, so maybe I interpreted wrong :x

(**) I still use mairix, but I'm considering starting a
     separate project to replace it for my personal mail(***)
     It may also be useful for prototyping future public-inbox
     changes, too.

(***) For private emails, I want IMAP support + offline
      memoization instead of my current mairix + offlineimap +
      archive-old-mail-to-mboxrd script.  I'd still want
      to rely on git for message caching/memoization.

  reply	other threads:[~2017-08-23  1:42 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-16 21:01 Nonlinear history? Stefan Beller
2017-08-23  1:42 ` Eric Wong [this message]
2017-08-23 18:29   ` Stefan Beller
2017-08-23 19:40     ` Eric Wong
2017-08-23 20:06     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170823014239.GA4113@starla \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).