From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id EB17C1F85D; Thu, 12 Jul 2018 01:47:15 +0000 (UTC) Date: Thu, 12 Jul 2018 01:47:15 +0000 From: Eric Wong To: "Eric W. Biederman" Cc: meta@public-inbox.org Subject: Re: Q: V2 format Message-ID: <20180712014715.dn5aouayoa3uejp4@dcvr> References: <87k1q1bky6.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87k1q1bky6.fsf@xmission.com> List-Id: "Eric W. Biederman" wrote: > I have been digging through the code looking so I can understand the v2 > format and I have some ideas on how things might be improved, and some > questions so that I understand. Great to know you're interested! Fwiw, I've still been meaning to turn my v2 docs into a POD manpage: https://public-inbox.org/meta/20180419015813.GA20051@dcvr/ > V1 supported the concept of messages being added and deleted from > the git repository all while keeping a full history of everything that > went on. The V2 code appears to have the name 'm' for added and 'd' for > deleted, but the public-inbox-index code appears to expect deletes to > happen by way of an altered history that totally purge the commits, > and does not process the 'd' entries. "Purge" is a new concept for v2 and not even exposed (yet) in via tools. Normal operations to remove files using 'd' (via -watch or -rm) don't rewrite old history so it won't disrupt non-force fetches. > What is the thinking about deleted entries, and for v2 what is the > preferred way to delete mail from a public inbox git repository and why? Definitely prefer the normal way with 'd' files to not break people using non-force fetches. "Purge" is too disruptive and reserved for extraordinary cases (e.g. legal reasons). > Size. Reading the history of the public inbox meta mailling list and > playing around I discovered that I can shave off about 100M of the V2 > size of the git public inbox git repository but pushing all of the > messages into a single commit. Not great for day to day operation, > but if rebasses are part of the plan, and old archives part of the > challenge I see quite a lot of potential for old archives to be reduced > to a git repository with a single commit. Rebases/rewriting history is definitely not part of the plan and a last resort. > Names. Is there a good reason not to use message numbers as the names > in the git repositories? (Other than the cost to change the code?) That > would remove the need for treat the sqlite msgmap database as precious, > and it would make it easier to recover if an nntp server goes away. In > V2 format the git mailing list git repository is only about 2M larger if > each message has it's msg number as it's name. Plus the git log > is easier to read as messages are all + or -. Big trees in git were a scalability problem in v1 because of the long 2/38 names. With shorter names you propose (base-10 serial number?, the scalability problem gets pushed off a bit, I suppose. But not indefinitely; and later v2 partitions will suffer more from longer names. I also want to limit the use and exposure of serial numbers as much as possible. It's unavoidable with the NNTP interface; but reliance on serial numbers in public interfaces leads to centralization. The current v2 is also better for inode-starved users in case somebody forgets to type "--mirror" or "--bare" with clone. For the most part (unless purge is used), the SQLite database is actually recoverable. So no, I don't think having serial numbers stored in filenames is the right thing. > xapian. Can the Xapian database be made optional in V2? Definitely in the TODO :) > I absolutely > think a quick search for terms and other things very valuable, so I > would never suggest giving up Xapian. On the other hand on my personal > laptop the xapian database for lkml takes ages and ages to build, and it > pushes the system into swap. Which is all around unpleasant. That > seems to eat into the distributed nature of the goal of public inbox. > I have tried to see what could be done that might shrink the size of > the xapian database. The only think I could think of is perhaps > sharding the xapian database by time/msgnum ranges. That would allow > the old xapians databases to be compacted and forgotten about, and I > think it would allow less wastage in the current xapian database as it > would be smaller, so wasting 50% space (or whatever the btrees waste) > would be less of an issue. And as smaller databases are faster I think > that would in general be a help. One big killer for Xapian is position information required for "quoted phrase searches". I seem to remember deleting the position.* files was safe as it would only break phrase searches (but I haven't tried it). So there should be an option to toggle between the "index_text" and routines in Xapian "index_text_without_positions". Given the way the indexing only works on the most recent data; I think one could also write a script to delete old data/results from Xapian without affecting current/future indexing. That would pop back up if/when there's schema upgrades requiring a rebuild, though... I believe there should be 3 levels of v2 operation: 1) SQLite-only (NNTP and all the threading stuff works) 2) SQLite + Xapian w/o positions (good enough for most things) 3) SQLite + Xapian w/ positions (current, default) 2) seems like a reasonable trade-off for most sites; I'm not sure how often phrase searching gets used. > Time permitting I am willing to do some of this work so that > public-inbox works well for me. I want to see what your vision is for > the code before I start anything. Thanks for running this by, first. I'm not convinced git layout changes are warranted at this point for v2. Making Xapian optional and configurable to use index_text_without_positions is something I definitely want to see happen, though.