From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN:  
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id 36D2820899;
	Wed, 23 Aug 2017 19:40:25 +0000 (UTC)
Date: Wed, 23 Aug 2017 19:40:25 +0000
From: Eric Wong <e@80x24.org>
To: Stefan Beller <sbeller@google.com>
Cc: Jeff King <peff@peff.net>, meta@public-inbox.org
Subject: Re: Nonlinear history?
Message-ID: <20170823194025.GA22495@starla>
References: <CAGZ79kZW6O_wCZRMrWDc1yXvQzTDbFOLrcjt=81XGJj=VUjBzw@mail.gmail.com>
 <20170823014239.GA4113@starla>
 <CAGZ79kYLcy_Pe0POUjUC+SaZYzFnLYhBYbY+ZNEBwc+32j8b1A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CAGZ79kYLcy_Pe0POUjUC+SaZYzFnLYhBYbY+ZNEBwc+32j8b1A@mail.gmail.com>
List-Id: <meta.public-inbox.org>

Stefan Beller <sbeller@google.com> wrote:
> On Tue, Aug 22, 2017 at 6:42 PM, Eric Wong <e@80x24.org> wrote:
> > Stefan Beller <sbeller@google.com> wrote:
> >> So I happened to search an old post of mine today,
> >> specifically I knew only a couple of bits of it:
> >> * I authored a patch series, that had a given
> >>   string in the name ("protocolv2")
> >> * I was looking for an answer by Peff
> >>
> >> To find the post in question I used both the local git mailing list repository
> >> (as cloned from https://public-inbox.org/git) to find the starting point[1]
> >> as well as the online list to see the relations between posts, such
> >> that I finally arrived at [2].
> >>
> >> However in the process of searching locally I wondered if the
> >> repository data could be organized better, instead of linearly.
> >
> > The reason it is organized linearly is so it can be
> > up-to-the-minute and fetched incrementally as soon as mail
> > arrives (or it is marked as spam).
> 
> So the design decision is to be as fast as possible on
> relaying the message, reducing the time from receiving
> to publishing?

Yes.  I encourage folks to run "git fetch" as often as possible
since a server can die at any time.

> > However, I've been considering after-the-fact organization, too
> > (similar to how packing works in git).  However, it's not
> > to optimize search, but to improve storage efficiency:
> >
> > 1) purge spam messages from history
> 
> I thought this would already happen. every once in a while
> I get a spam mail via the mailing list and the last time I checked
> it was not to be found in the public-inbox archive, so I assumed
> spam filtering is already part of the decision for each new
> message.

There's already filtering via SpamAssassin on my server, as well
as whatever vger uses, but some spam will always slip through.

You can audit my manual spam removals via: git log -p --diff-filter=D

See https://public-inbox.org/dc-dlvr-spam-flow.txt for implementation.
And I will always encourage independent audits and independently-run
instances to ensure I'm not censoring anything (or let a cat near
an unlocked keyboard :x)

> > 2) squash to reduce tree and commit objects
> 
> This would only work for patch series, such that the
> author is kept (and time is only skewed by a few seconds)

Actually, we could periodically squash all history.
But that would make auditing message removals more difficult
from a third party, so maybe not a good idea...

> > 3) perhaps choose smarter filenames which can improve
> >    packing heuristics
> >
> > So I'm also strongly favoring moving away from the 2/38
> > Message-ID naming scheme we currently use, too.
> 
> I wondered if the email message ID (i.e.
> 20170823014239.GA4113@starla/) is a good base for a
> naming scheme? (sharding into directories would need to
> be added. Maybe even 'in reverse'? That would help
> to separate mails by host/sender).

I was thinking about using normalized[1] subject somehow;
perhaps updating/replacing the older message since (AFAIK)
basename is a packing heuristic, and newer messages usually
quote part of the original, which should help with deltas.

That would require different handling of spam removals, though,
since some spam may have the same subject as a non-spam and
recording a deletion in history might not be reliable that we
really want to stop showing a message.

[1] - "Re: " prefixes removed, and possibly whitespace normalized

> >> So what if the git history would reflect the parent relationships
> >> of the emails? Essentially each email is comparable to
> >> a topic branch in the git workflow (potentially with other
> >> series/email on top of it). Each topic would be merged
> >> to master immediately, such that the first parent master
> >> branch history consists of merges only; the second parent
> >> is the new ingested email, which is either a root-commit
> >> (when a new topic is started), or a commit building on top
> >> of another commit (which contains the email it is responding
> >> to; that other commit is merged to master already).
> >
> > That may not work well because emails arrive out-of-order,
> > especially when many are sent in rapid succession with
> > "git send-email".  I've had to make bugfixes to some of the
> > Perl+Xapian logic to deal with OOO message delivery, too :)
> >
> > And having to map Message-ID to a particular commit would
> > require extra overhead to keep track of parents, no?
> 
> Yes, but that is already a problem while using the data as a viewer.
> Every time I visit https://public-inbox.org/meta/20170823014239.GA4113@starla/
> the server needs to compute the "thread overview". If the git history
> would be grouped by message relationships, the querying could be
> done via git, which -now that I think about it-  may not actually be
> cheaper than searching in the "unstructured" data as of now.

Right.  git itself isn't optimized for querying, and even doing
the 2/38 tree lookups was rather expensive, too; so we store
git object IDs directly in Xapian, now:
https://public-inbox.org/meta/20160805010300.7053-1-e@80x24.org/

Querying is what Xapian was designed for, anyways, and it seems
good at it.

> > In short, I would like to depend more on the search engine for
> > logic and keep that flexible; but continue to keep the (git)
> > storage layer "dumb".  The smarts would be in Xapian, which can
> > be tuned and refined after-the-fact with minimal refetching.
> 
> eh. I see your point (and motivation as the maintainer of public
> inbox).
> 
> As a user I would have hoped for a "smart" git layer, as I like
> searching the data using git tools, which would be enhanced if
> the git layer is not "dumb".

I can understand your wishes; but I consider it a goal to make
the data accessible, first.  And at least some of the common
"git log" stuff still works for now (but won't if we start
squashing history).

That way, different developers can have different priorities
w.r.t. search.  public-inbox has patch-specific searching
functionality optimized for vger workflows right now; but
perhaps somebody else could fork it and tune it for searching
other things in email.

> > And Xapian could also be swapped out for alternative search
> > engines, too (Groonga, maybe).  I consider it having a similar
> > in philosophy to git itself w.r.t. storage optimization,
> > merge strategies, and rename detection.
> 
> So dumb data, with a smart (and potentially even smarter
> in the future) program on top.

Exactly.