From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS6315 166.70.0.0/16 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_LOW,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id BC2AF1F85D; Wed, 11 Jul 2018 20:02:20 +0000 (UTC) Received: from in02.mta.xmission.com ([166.70.13.52]) by out03.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fdLJZ-00048I-Sz; Wed, 11 Jul 2018 14:02:17 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fdLJE-0007Jf-HA; Wed, 11 Jul 2018 14:02:17 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Eric Wong Cc: Date: Wed, 11 Jul 2018 15:01:53 -0500 Message-ID: <87k1q1bky6.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1fdLJE-0007Jf-HA;;;mid=<87k1q1bky6.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+9mU3Po1oyokvRPgRTkpYzGBUZp4P5Ekk= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Q: V2 format X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) List-Id: I have been digging through the code looking so I can understand the v2 format and I have some ideas on how things might be improved, and some questions so that I understand. V1 supported the concept of messages being added and deleted from the git repository all while keeping a full history of everything that went on. The V2 code appears to have the name 'm' for added and 'd' for deleted, but the public-inbox-index code appears to expect deletes to happen by way of an altered history that totally purge the commits, and does not process the 'd' entries. What is the thinking about deleted entries, and for v2 what is the preferred way to delete mail from a public inbox git repository and why? Size. Reading the history of the public inbox meta mailling list and playing around I discovered that I can shave off about 100M of the V2 size of the git public inbox git repository but pushing all of the messages into a single commit. Not great for day to day operation, but if rebasses are part of the plan, and old archives part of the challenge I see quite a lot of potential for old archives to be reduced to a git repository with a single commit. Names. Is there a good reason not to use message numbers as the names in the git repositories? (Other than the cost to change the code?) That would remove the need for treat the sqlite msgmap database as precious, and it would make it easier to recover if an nntp server goes away. In V2 format the git mailing list git repository is only about 2M larger if each message has it's msg number as it's name. Plus the git log is easier to read as messages are all + or -. xapian. Can the Xapian database be made optional in V2? I absolutely think a quick search for terms and other things very valuable, so I would never suggest giving up Xapian. On the other hand on my personal laptop the xapian database for lkml takes ages and ages to build, and it pushes the system into swap. Which is all around unpleasant. That seems to eat into the distributed nature of the goal of public inbox. I have tried to see what could be done that might shrink the size of the xapian database. The only think I could think of is perhaps sharding the xapian database by time/msgnum ranges. That would allow the old xapians databases to be compacted and forgotten about, and I think it would allow less wastage in the current xapian database as it would be smaller, so wasting 50% space (or whatever the btrees waste) would be less of an issue. And as smaller databases are faster I think that would in general be a help. Time permitting I am willing to do some of this work so that public-inbox works well for me. I want to see what your vision is for the code before I start anything. Eric