From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN:  
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id 974A61F404;
	Thu, 15 Mar 2018 21:21:44 +0000 (UTC)
Date: Thu, 15 Mar 2018 21:21:44 +0000
From: Eric Wong <e@80x24.org>
To: Stefan Monnier <monnier@IRO.UMontreal.CA>
Cc: meta@public-inbox.org
Subject: Re: internal format
Message-ID: <20180315212144.GA3032@whir>
References: <m2371f9wau.fsf@gmail.com>
 <20180305020754.GA11496@dcvr>
 <jwv1sgl5php.fsf-monnier+inbox.comp.mail.public-inbox.meta@gnu.org>
 <20180315164012.GA20246@whir>
 <jwvzi39xkj6.fsf-monnier+Inbox@gnu.org>
 <20180315201420.GA30804@whir>
 <jwvin9xqct2.fsf-monnier+Inbox@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <jwvin9xqct2.fsf-monnier+Inbox@gnu.org>
List-Id: <meta.public-inbox.org>

Stefan Monnier <monnier@IRO.UMontreal.CA> wrote:
> >> For timing, I'm curious why you only consider
> >> "git rev-list --objects --all".  Which operation does this corresponds
> >> to in public-inbox and is that really the only one that is
> >> performance-sensitive?
> > That traverses the object graph (same walk used for repacking
> > where bitmaps don't help).
> 
> Yes, I understand what it does in Git, but I wonder why a full traversal
> of the graph is the only/main operation you care about.
> 
> Hmm... I guess your other operations are:
> - lookup by message-id (which is made efficient because you index files
>   by the message-id).
> - everything else is done by keeping another index (from NNTP article
>   number to message-id (or to blob?)), as in the case of Xapian.
> 
> Actually, if you directly index the blobs, you don't really need to
> index your file by message-id (you could keep the index from message-id
> to blobs external, just as is done for Xapian, right?).

Right, storing blob OIDs in Xapian means tree lookups are irrelevant
to read performance.  Since we can rely on Xapian for v2, we can
fix the graph traversal problem by simplifying the trees and
speed up writes by having smaller trees.

The only remaining performance pain point is the overall size of
repos (which we work around by partitioning).