From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 142121F42D; Mon, 19 Mar 2018 07:43:51 +0000 (UTC) Date: Mon, 19 Mar 2018 07:43:50 +0000 From: Eric Wong To: =?utf-8?Q?Nicol=C3=A1s_Ojeda_B=C3=A4r?= Cc: meta@public-inbox.org Subject: watch performance [was: Relationship between public-inbox and ssoma?] Message-ID: <20180319074350.ga4ndkyubdrif5xu@dcvr.yhbt.net> References: <20180305020754.GA11496@dcvr> <20180305175007.GA19007@whir> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: Nicolás Ojeda Bär wrote: > On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong wrote: > > Nicolás Ojeda Bär wrote: > >> Hello Eric, > >> > >> Thanks for the prompt reply. I am trying to migrate a long-lived > >> mailing list (65k messages over 26 years), below are some > >> troubles/questions I am having; > >> any suggestions would be greatly appreciated. > >> > >> - public-inbox-watch seems to struggle with very big maildirs; for now > >> I am moving the data into the maildir a little at a time and that > >> seems to work. Is there a particular obstacle > >> to making the importing process more incremental? Heh, I've been adjusting some of that code to support v2 and -watch has actually has been incremental for a while. It tries to balance work between inboxes fairly and might be writing data out to disk than you want it to for initial imports. It was a trade-off for allowing readers to see up-to-date data and throughput. Also, I forget to ask, are you on Linux with Inotify support? I haven't tried Filesys::Notify::Simple (used by -watch) without it so maybe other OSes struggle. > > I usually prefer one-off scripts like > > scripts/import_vger_from_mbox for initial imports and store > > large archives in compressed mboxes instead of Maildir. Lack of > > mbox support is one reason I never used notmuch despite studying > > it. Ah, another thing I do almost subconciously for running imports and tests is use "eatmydata" to disable fsync: https://www.flamingspork.com/projects/libeatmydata/ Running -watch with eatmydata on my desktop with an SSD, I didn't notice any problems with ~28K mail from LKML from the past month or so. It might be a pain to support our own knobs for disabling fsync: There's one knob for Xapian (only 1.4.x, I think), one knob for SQLite, and git doesn't allow disabling fsync on packs, yet, only loose objects at the moment; so "eatmydata" is probably the easiest. > > And I'm thinking about favoring Received: over Date: if both > > exist, since Date: headers are more often wrong... Ugh, but there's patchbombs and git adjusts Date: to get sorting right for MUAs, so using Received: makes those out-of-order :< So overall inbox sorting might use Received:, but sorting within individual threads will need to use the Date: header.