From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS376 132.204.0.0/16 X-Spam-Status: No, score=-2.0 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_MED,SPF_SOFTFAIL,T_RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from pruche.dit.umontreal.ca (pruche.dit.umontreal.ca [132.204.246.22]) by dcvr.yhbt.net (Postfix) with ESMTP id A00991F42D for ; Thu, 15 Mar 2018 21:05:07 +0000 (UTC) Received: from ceviche.home (lechon.iro.umontreal.ca [132.204.27.242]) by pruche.dit.umontreal.ca (8.14.7/8.14.1) with ESMTP id w2FL533I030382; Thu, 15 Mar 2018 17:05:04 -0400 Received: by ceviche.home (Postfix, from userid 20848) id A5CE06649D; Thu, 15 Mar 2018 17:05:03 -0400 (EDT) From: Stefan Monnier To: Eric Wong Cc: meta@public-inbox.org Subject: Re: internal format Message-ID: References: <20180305020754.GA11496@dcvr> <20180315164012.GA20246@whir> <20180315201420.GA30804@whir> Date: Thu, 15 Mar 2018 17:05:03 -0400 In-Reply-To: <20180315201420.GA30804@whir> (Eric Wong's message of "Thu, 15 Mar 2018 20:14:20 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-NAI-Spam-Flag: NO X-NAI-Spam-Threshold: 5 X-NAI-Spam-Score: 0 X-NAI-Spam-Rules: 2 Rules triggered EDT_SA_DN_PASS=0, RV6243=0 X-NAI-Spam-Version: 2.3.0.9418 : core <6243> : inlines <6492> : streams <1781494> : uri <2608852> List-Id: >> For timing, I'm curious why you only consider >> "git rev-list --objects --all". Which operation does this corresponds >> to in public-inbox and is that really the only one that is >> performance-sensitive? > That traverses the object graph (same walk used for repacking > where bitmaps don't help). Yes, I understand what it does in Git, but I wonder why a full traversal of the graph is the only/main operation you care about. Hmm... I guess your other operations are: - lookup by message-id (which is made efficient because you index files by the message-id). - everything else is done by keeping another index (from NNTP article number to message-id (or to blob?)), as in the case of Xapian. Actually, if you directly index the blobs, you don't really need to index your file by message-id (you could keep the index from message-id to blobs external, just as is done for Xapian, right?). > We currently store blob SHA-1s in Xapian to avoid tree lookups > in git. Having a history rewrite can break an entire chain of > unrelated messages if we store commit SHA-1 in Xapian instead of > blobs. Ah, indeed, keeping them as files means that the file's own SHA won't change when you rewrite history so it makes it much easier to rewrite history if you rely on this (also probably a lot more efficient within Git). >> Now I'm left wondering what it would mean for something like >> public-inbox to support merging. > I consider it a waste of effort to maintain an authoritive > commit history when archiving mail. Indeed, as long as we're left wondering what good it would do to be able to merge, we're left with its downsides. Stefan