From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id A224C1F466; Tue, 4 Feb 2020 20:55:41 +0000 (UTC) Date: Tue, 4 Feb 2020 20:55:41 +0000 From: Eric Wong To: Luke Kenneth Casson Leighton Cc: meta@public-inbox.org Subject: Re: setting up mailman-to-atom-converter then atom-to-public-inbox Message-ID: <20200204205541.GB27797@dcvr> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Luke Kenneth Casson Leighton wrote: > hi, just as the subject says, i'm currently modifying mailman_rss to > support atom and would like to set it up on libre-soc.org shortly. > > firstly: very grateful that public-inbox even exists, it is kinda > important to have really, really simple offline archives of project > mailing lists. You're welcome :> > second: i have no idea how to go about setting it up :) Once installed, "public-inbox-init" should get you started. >From there, you can decide how you want to inject mail into it... We should be able to clarify anything else here, just ask, and we can try to make the docs better :> Fwiw, I also started working on a mail flow diagram yesterday, which may help: https://public-inbox.org/flow.txt > third: sigh, i have two unknowns (three), because i am actually > modifying mailman_rss to support atom, *and* i would prefer not to > overload my server by splitting up the creation of atom feeds into > multiple separate processing sections (by month) *and* i have no idea > if public-inbox can support feeds-of-feeds. This is your Mailman server? If so, mbox or Maildir archives would be MUCH easier to convert and it would preserve Message-Id, References, and In-Reply-To headers for proper message threading. public-inbox doesn't have any ability to parse Atom or RSS right now, it only generates Atom. Parsing Atom (or RSS) would not preserve headers necessary for proper threading, since Atom threading headers (RFC4685) don't reliably map back to the aforementioned mail headers. > to explain / unpack that: here's how i would envisage the workflow so > as to minimise the server load: > > * cron job goes through the monthly mailman archives *by month* > performing a re-creation *only* of the latest month's atom feed > * same cron job adds to a "global" atom file containing "links to the > monthly atom files" > * public-inbox sees that list-of-monthly-atom-files > * public-inbox walks the "tree" of monthly atom files, grabbing each one in turn > * public-inbox loads all messages from all monthly atom files. s/atom/mbox/ and that's close to a planned feature. I'm not sure why the global index file is necessary, though, since the tree structure is predictable (YYYY/MM or similar) Also, Konstantin wrote list-archive-maker.py which parses pipermail archives: https://public-inbox.org/meta/CAMwyc-T+QrzNhfgg1kQWTrKa26CeHvEd6BFahGiLC3PKOZJurw@mail.gmail.com/ > is this possible or does public-inbox expect one whopping monster > resource-hogging beast-of-an-atom-file potentially hundreds of > megabytes long? (the reason i ask all this is because the server i am > running this on only has 1GB of RAM and i'm not going to be upgrading > it as it costs money). Totally understood, I'm constantly looking for ways to cut memory use and refuse to upgrade my RAM or CPU. Right now, public-inbox itself doesn't parse XML at all, only some test cases do. If you're using a SAX parser for XML (e.g. XML::SAX, XML::LibXML::SAX, ...), it should be able to stream everything and not hold more than the contents of a single email in memory at once. The existing mbox import APIs (e.g. scripts/import_vger_from_mbox) work like that. Internally, public-inbox tries to stream as much as possible to save memory. More RAM still helps if you have slow storage and/or big archives, though, especially with Xapian. public-inbox itself uses the Email::MIME module, which unfortunately requires reading an entire RFC-2822 message into memory (and we only work on one full message at a time). Beyond that, the message threading in the HTML output (non-recursive JWZ-variant) works on a batch of 1000 message skeletons (subset of headers), and few threads are that big.