user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Cc: meta@public-inbox.org
Subject: Re: setting up mailman-to-atom-converter then atom-to-public-inbox
Date: Tue, 4 Feb 2020 20:55:41 +0000	[thread overview]
Message-ID: <20200204205541.GB27797@dcvr> (raw)
In-Reply-To: <CAPweEDz9wN+OO9twe3JJFYgi2_3yD+9X4s9Z0g2Hqjyf2F8=JA@mail.gmail.com>

Luke Kenneth Casson Leighton <lkcl@lkcl.net> wrote:
> hi, just as the subject says, i'm currently modifying mailman_rss to
> support atom and would like to set it up on libre-soc.org shortly.
> 
> firstly: very grateful that public-inbox even exists, it is kinda
> important to have really, really simple offline archives of project
> mailing lists.

You're welcome :>

> second: i have no idea how to go about setting it up :)

Once installed, "public-inbox-init" should get you started.
From there, you can decide how you want to inject mail into
it...

We should be able to clarify anything else here, just ask,
and we can try to make the docs better :>
Fwiw, I also started working on a mail flow diagram yesterday,
which may help:

	https://public-inbox.org/flow.txt

> third: sigh, i have two unknowns (three), because i am actually
> modifying mailman_rss to support atom, *and* i would prefer not to
> overload my server by splitting up the creation of atom feeds into
> multiple separate processing sections (by month) *and* i have no idea
> if public-inbox can support feeds-of-feeds.

This is your Mailman server?  If so, mbox or Maildir archives
would be MUCH easier to convert and it would preserve
Message-Id, References, and In-Reply-To headers for proper
message threading.

public-inbox doesn't have any ability to parse Atom or RSS right
now, it only generates Atom.

Parsing Atom (or RSS) would not preserve headers necessary for
proper threading, since Atom threading headers (RFC4685) don't
reliably map back to the aforementioned mail headers.

> to explain / unpack that: here's how i would envisage the workflow so
> as to minimise the server load:
> 
> * cron job goes through the monthly mailman archives *by month*
> performing a re-creation *only* of the latest month's atom feed
> * same cron job adds to a "global" atom file containing "links to the
> monthly atom files"
> * public-inbox sees that list-of-monthly-atom-files
> * public-inbox walks the "tree" of monthly atom files, grabbing each one in turn
> * public-inbox loads all messages from all monthly atom files.

s/atom/mbox/ and that's close to a planned feature.

I'm not sure why the global index file is necessary, though,
since the tree structure is predictable (YYYY/MM or similar)

Also, Konstantin wrote list-archive-maker.py which parses
pipermail archives:
https://public-inbox.org/meta/CAMwyc-T+QrzNhfgg1kQWTrKa26CeHvEd6BFahGiLC3PKOZJurw@mail.gmail.com/

> is this possible or does public-inbox expect one whopping monster
> resource-hogging beast-of-an-atom-file potentially hundreds of
> megabytes long?  (the reason i ask all this is because the server i am
> running this on only has 1GB of RAM and i'm not going to be upgrading
> it as it costs money).

Totally understood, I'm constantly looking for ways to cut
memory use and refuse to upgrade my RAM or CPU.

Right now, public-inbox itself doesn't parse XML at all,
only some test cases do.

If you're using a SAX parser for XML (e.g. XML::SAX,
XML::LibXML::SAX, ...), it should be able to stream everything
and not hold more than the contents of a single email in memory
at once.

The existing mbox import APIs (e.g.
scripts/import_vger_from_mbox) work like that.

Internally, public-inbox tries to stream as much as possible to
save memory.  More RAM still helps if you have slow storage
and/or big archives, though, especially with Xapian.

public-inbox itself uses the Email::MIME module, which
unfortunately requires reading an entire RFC-2822 message into
memory (and we only work on one full message at a time).

Beyond that, the message threading in the HTML output
(non-recursive JWZ-variant) works on a batch of 1000 message
skeletons (subset of headers), and few threads are that big.

  reply	other threads:[~2020-02-04 20:55 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-04 18:42 setting up mailman-to-atom-converter then atom-to-public-inbox Luke Kenneth Casson Leighton
2020-02-04 20:55 ` Eric Wong [this message]
2020-02-04 21:49   ` Luke Kenneth Casson Leighton
2020-02-04 22:14     ` Eric Wong
     [not found]       ` <CAPweEDy1qTK93pXDKdbT-HqJV184fH7x0hqqJYDTMv_nxvoKqQ@mail.gmail.com>
2020-02-05  0:10         ` Eric Wong
     [not found]           ` <CAPweEDyYA+38B4uc+stMpZ9q6CrHaaAAkkorCuH4ONHmhBXbXg@mail.gmail.com>
2020-02-05  0:43             ` Eric Wong
2020-02-05  1:02               ` Kyle Meyer
2020-02-05  1:04                 ` Eric Wong
2020-03-10  0:07   ` setting up mailman2 and public-inbox Luke Kenneth Casson Leighton
2020-03-11 10:33     ` Eric Wong
2020-03-11 11:58       ` Luke Kenneth Casson Leighton
2020-03-11 12:47         ` Luke Kenneth Casson Leighton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200204205541.GB27797@dcvr \
    --to=e@yhbt.net \
    --cc=lkcl@lkcl.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).