From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 974A61F404; Thu, 15 Mar 2018 21:21:44 +0000 (UTC) Date: Thu, 15 Mar 2018 21:21:44 +0000 From: Eric Wong To: Stefan Monnier Cc: meta@public-inbox.org Subject: Re: internal format Message-ID: <20180315212144.GA3032@whir> References: <20180305020754.GA11496@dcvr> <20180315164012.GA20246@whir> <20180315201420.GA30804@whir> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Stefan Monnier wrote: > >> For timing, I'm curious why you only consider > >> "git rev-list --objects --all". Which operation does this corresponds > >> to in public-inbox and is that really the only one that is > >> performance-sensitive? > > That traverses the object graph (same walk used for repacking > > where bitmaps don't help). > > Yes, I understand what it does in Git, but I wonder why a full traversal > of the graph is the only/main operation you care about. > > Hmm... I guess your other operations are: > - lookup by message-id (which is made efficient because you index files > by the message-id). > - everything else is done by keeping another index (from NNTP article > number to message-id (or to blob?)), as in the case of Xapian. > > Actually, if you directly index the blobs, you don't really need to > index your file by message-id (you could keep the index from message-id > to blobs external, just as is done for Xapian, right?). Right, storing blob OIDs in Xapian means tree lookups are irrelevant to read performance. Since we can rely on Xapian for v2, we can fix the graph traversal problem by simplifying the trees and speed up writes by having smaller trees. The only remaining performance pain point is the overall size of repos (which we work around by partitioning).