From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 3F7C11F466; Tue, 4 Feb 2020 22:14:25 +0000 (UTC) Date: Tue, 4 Feb 2020 22:14:25 +0000 From: Eric Wong To: Luke Kenneth Casson Leighton Cc: meta@public-inbox.org Subject: Re: setting up mailman-to-atom-converter then atom-to-public-inbox Message-ID: <20200204221425.GA7973@dcvr> References: <20200204205541.GB27797@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Luke Kenneth Casson Leighton wrote: > On Tue, Feb 4, 2020 at 9:05 PM Eric Wong wrote: > > Luke Kenneth Casson Leighton wrote: > > > > > second: i have no idea how to go about setting it up :) > > > > Once installed, "public-inbox-init" should get you started. > > From there, you can decide how you want to inject mail into > > it... > > ahh exxcellent.... err... err.... man public-inbox-config only lists > Maildir not mbox? Ah, right now mbox is only supported for one-off initial scripts such as scripts/import_vger_from_mbox mbox is pretty bad for incremental updates, especially if there's big rewrites going on for Status: flags setting off inotify/EVFILT_VNODE. I suppose it could be added, but Maildir is way easier and faster for incremental updates, since deduplication can slow things down a bit. > > > * cron job goes through the monthly mailman archives *by month* > > > performing a re-creation *only* of the latest month's atom feed > > > * same cron job adds to a "global" atom file containing "links to the > > > monthly atom files" > > > * public-inbox sees that list-of-monthly-atom-files > > > * public-inbox walks the "tree" of monthly atom files, grabbing each one in turn > > > * public-inbox loads all messages from all monthly atom files. > > > > s/atom/mbox/ and that's close to a planned feature. > > oh superb. > > > I'm not sure why the global index file is necessary, though, > > since the tree structure is predictable (YYYY/MM or similar) > > i was imagining that there would be a way to reduce network traffic > however i realise now that you're running the cron job actually on the > machine, directly on the .mbox file. Yeah. I was planning on supporting a HTTP(S)-based scraper, anyways for pipermail and Google Groups, anyways, but time's been taken up by other things. > > public-inbox itself uses the Email::MIME module, which > > unfortunately requires reading an entire RFC-2822 message into > > memory (and we only work on one full message at a time). > > *shudder* :) I get scary attachments, sometimes :< > okaay, so i'm looking at man public-inbox-config, it says "only > supports Maildir". grep the source, there's something about > PublicInbox::Import.pm? The supported/stable PublicInbox::V2Writable API mostly matches the documented PublicInbox::Import one, and v2 is much better for long-term use or big archives. scripts/import_vger_from_mbox is probably a good example to start with. > ngggh how am i going to get mbox files in / watched? I'm not sure it's necessary, just yet. mbox for the initial import, and Maildir for incremental updates is probably the easiest way to go in your case. Eventually HTTPS downloads can be supported (maybe in a few months or by the end-of-year), and that'll be mbox, anyways. > thanks eric. No prob :>