From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 042601F85E; Fri, 13 Jul 2018 22:22:01 +0000 (UTC) Date: Fri, 13 Jul 2018 22:22:00 +0000 From: Eric Wong To: "Eric W. Biederman" Cc: meta@public-inbox.org Subject: msgmap serial number regeneration [was: Q: V2 format] Message-ID: <20180713222200.GB27845@dcvr> References: <87k1q1bky6.fsf@xmission.com> <20180712014715.dn5aouayoa3uejp4@dcvr> <87k1q07dyc.fsf@xmission.com> <20180712230946.mqv3yjw4aabf7xrf@dcvr.yhbt.net> <878t6f1ch7.fsf@xmission.com> <87h8l2ykb4.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87h8l2ykb4.fsf@xmission.com> List-Id: "Eric W. Biederman" wrote: > ebiederm@xmission.com (Eric W. Biederman) writes: > > Eric Wong writes: > >> "Eric W. Biederman" wrote: > >>> > >>> Because of the parallelism in V2 I have noticed messages in numbered > >>> in an order that does not correspond to their commit order. So the > >>> SQLite database isn't as recoverable as it might be. Especially as the > >>> parallelism introduces an element of non-determinancy. > >> > >> *puzzled* were you able to reproduce that? The serial number > >> generation + threading happens in the main process and the > >> parallelism is limited to Xapian text indexing. -index > >> generates serial numbers by walking backwards with v2, and > >> complains on unexpected results. > > Digging into this I have found consistenly non-reproducible numbering, > because of deleted files. Apparently in both V1 and V2 an a worst-case > estimate is made of the total numbers that are going to be needed and > numbers are assigned backwards from there. > > A fresh indexing of the git mailling list archive on v1 gives me numbers > starting with 360 and on v2 numbers starting with 355. Which > corresponds with the number of deleted messages. > > I am still looking to see if there are any other weird things here. Ah, yes, you're correct deletes don't get accounted for when regenerating. Oh well. I guess it was correct to document msgmap as something important to backup and not break for instances of particular servers. (emphasis on "particular servers") So I think you'd need to walk revision history twice to account for deleted messages... Across different machines, it should not matter to preserve serials. > I definitely do not like not being able to reconstruct message numbers > from a backup. For v2, I see serial numbers are an internal optimization which happens to map to NNTP. If the git repo is cloned and the cloner sets up a different server, it'll have a different address and clients won't know to deduplicate them anyways. I suppose it makes the load-balanced case a little more complex to sync(*) And this can't even account for independently started mirrors with no common git ancestry, as SMTP has zero guarantees on ordering. (*) But optimizing for load-balanced instances isn't ideal, I'd rather see more independently-run servers than giant load-balanced instances which everybody relies on.