From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS376 132.204.0.0/16 X-Spam-Status: No, score=-1.9 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_MED,SPF_SOFTFAIL,T_RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from chene.dit.umontreal.ca (chene.dit.umontreal.ca [132.204.246.20]) by dcvr.yhbt.net (Postfix) with ESMTP id 511E01F42D for ; Thu, 15 Mar 2018 18:49:39 +0000 (UTC) Received: from ceviche.home (lechon.iro.umontreal.ca [132.204.27.242]) by chene.dit.umontreal.ca (8.14.7/8.14.1) with ESMTP id w2FInZTA015559; Thu, 15 Mar 2018 14:49:36 -0400 Received: by ceviche.home (Postfix, from userid 20848) id AA8506649D; Thu, 15 Mar 2018 14:49:35 -0400 (EDT) From: Stefan Monnier To: Eric Wong Cc: meta@public-inbox.org Subject: Re: internal format Message-ID: References: <20180305020754.GA11496@dcvr> <20180315164012.GA20246@whir> Date: Thu, 15 Mar 2018 14:49:35 -0400 In-Reply-To: <20180315164012.GA20246@whir> (Eric Wong's message of "Thu, 15 Mar 2018 16:40:12 +0000") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-NAI-Spam-Flag: NO X-NAI-Spam-Level: X-NAI-Spam-Threshold: 5 X-NAI-Spam-Score: 0.2 X-NAI-Spam-Rules: 3 Rules triggered SNW_DPH=0.2, EDT_SA_DN_PASS=0, RV6243=0 X-NAI-Spam-Version: 2.3.0.9418 : core <6243> : inlines <6492> : streams <1781485> : uri <2608794> List-Id: > v1 or v2? Some of the reasoning for v2 was here: > https://public-inbox.org/meta/20180209205140.GA11047@dcvr/ IIUC, the issues you consider important are: - Size - Time to perform "git rev-list --objects --all" - Flexibility, e.g. to be able to remove messages. For size your benchmarks seem to indicate that as long as it's kept inside Git, the choice of format doesn't actually affect it significantly (and this matches my expectations). Tho I guess it's probably possible to improve on it with enough efforts (e.g. storing attachments separately, or splitting large messages into chunks, e.g. like `bup` does), but I doubt it's worth the effort (especially if you assume that the mailing-list imposes a limit on message size). For timing, I'm curious why you only consider "git rev-list --objects --all". Which operation does this corresponds to in public-inbox and is that really the only one that is performance-sensitive? > As for git itself: reliability, ease-of-replication, storage > efficiency. Yes, that part I totally understand (same reason I used Git in BuGit https://gitlab.com/monnier/bugit). Part of my question was related to the fact that in BuGit I store the messages in the commit-object rather than in files (which trivially gives me conflict-free merges as well as "discussion threads") so I was wondering if it would make sense in the case of public-inbox to keep the email messages in the commit objects rather than in files, but since I don't really know which operations are frequent/important I really have no idea. One thing that strikes me is that you don't seem to use its "decentralization": IIUC public-inbox always assumes one of the repositories is the "master" and others are mirrors (or mirrors of mirrors), so you get efficient "fast-forward" updates, but you don't do "merges". This probably means that keeping the email messages in commit objects wouldn't bring any benefits. Also this means that public-inbox could freely rewrite history, for example (which you'll need to really expunge messages) and just use "forced updates" in mirrors. Now I'm left wondering what it would mean for something like public-inbox to support merging. Stefan