From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 34B8B208DB; Wed, 23 Aug 2017 01:42:39 +0000 (UTC) Date: Wed, 23 Aug 2017 01:42:39 +0000 From: Eric Wong To: Stefan Beller Cc: meta@public-inbox.org Subject: Re: Nonlinear history? Message-ID: <20170823014239.GA4113@starla> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Stefan Beller wrote: > So I happened to search an old post of mine today, > specifically I knew only a couple of bits of it: > * I authored a patch series, that had a given > string in the name ("protocolv2") > * I was looking for an answer by Peff > > To find the post in question I used both the local git mailing list repository > (as cloned from https://public-inbox.org/git) to find the starting point[1] > as well as the online list to see the relations between posts, such > that I finally arrived at [2]. > > However in the process of searching locally I wondered if the > repository data could be organized better, instead of linearly. The reason it is organized linearly is so it can be up-to-the-minute and fetched incrementally as soon as mail arrives (or it is marked as spam). However, I've been considering after-the-fact organization, too (similar to how packing works in git). However, it's not to optimize search, but to improve storage efficiency: 1) purge spam messages from history 2) squash to reduce tree and commit objects 3) perhaps choose smarter filenames which can improve packing heuristics So I'm also strongly favoring moving away from the 2/38 Message-ID naming scheme we currently use, too. > So what if the git history would reflect the parent relationships > of the emails? Essentially each email is comparable to > a topic branch in the git workflow (potentially with other > series/email on top of it). Each topic would be merged > to master immediately, such that the first parent master > branch history consists of merges only; the second parent > is the new ingested email, which is either a root-commit > (when a new topic is started), or a commit building on top > of another commit (which contains the email it is responding > to; that other commit is merged to master already). That may not work well because emails arrive out-of-order, especially when many are sent in rapid succession with "git send-email". I've had to make bugfixes to some of the Perl+Xapian logic to deal with OOO message delivery, too :) And having to map Message-ID to a particular commit would require extra overhead to keep track of parents, no? So, I think what you're describing already happens in the Perl search code as every message gets assigned a thread_id when it is indexed in Xapian. I suppose you still cannot look at AGPL-3 code (being a Googler), but I stole the logic from notmuch (C++, GPL-3+, no 'A') circa 2015/2016(*). I believe mairix(**) uses similar logic for mapping messages to thread IDs, too. So the thread skeleton you see at the bottom of every message page is done using a boolean thread_id search OR-ed with a Subject search. In short, I would like to depend more on the search engine for logic and keep that flexible; but continue to keep the (git) storage layer "dumb". The smarts would be in Xapian, which can be tuned and refined after-the-fact with minimal refetching. And Xapian could also be swapped out for alternative search engines, too (Groonga, maybe). I consider it having a similar in philosophy to git itself w.r.t. storage optimization, merge strategies, and rename detection. > Has this idea been come up before or even discussed before? Not exactly what you're asking, but I guess what I described is similar to what we already do via Xapian. > Thanks, > Stefan > > [1] I really like the search by author feature! > [2] https://public-inbox.org/git/20150604130902.GA12404@peff.net/ Thanks. Fwiw, I did not know Xapian at all when I started this project, but it's become one of my favorite things about it :) And I stole the f:, tc:, s: and several other prefixes from mairix. (*) I've never used notmuch as I don't use Maildir for long-term archival and I don't think notmuch ever supported anything else. I also don't know C++, so maybe I interpreted wrong :x (**) I still use mairix, but I'm considering starting a separate project to replace it for my personal mail(***) It may also be useful for prototyping future public-inbox changes, too. (***) For private emails, I want IMAP support + offline memoization instead of my current mairix + offlineimap + archive-old-mail-to-mboxrd script. I'd still want to rely on git for message caching/memoization.