user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Cc: meta@public-inbox.org
Subject: Re: setting up mailman-to-atom-converter then atom-to-public-inbox
Date: Tue, 4 Feb 2020 22:14:25 +0000	[thread overview]
Message-ID: <20200204221425.GA7973@dcvr> (raw)
In-Reply-To: <CAPweEDw0yRq2tKRcumuxH523PLkvRb3saXVzM86LXYQReYKPaQ@mail.gmail.com>

Luke Kenneth Casson Leighton <lkcl@lkcl.net> wrote:
> On Tue, Feb 4, 2020 at 9:05 PM Eric Wong <e@yhbt.net> wrote:
> > Luke Kenneth Casson Leighton <lkcl@lkcl.net> wrote:
> >
> > > second: i have no idea how to go about setting it up :)
> >
> > Once installed, "public-inbox-init" should get you started.
> > From there, you can decide how you want to inject mail into
> > it...
> 
> ahh exxcellent....  err... err.... man public-inbox-config only lists
> Maildir not mbox?

Ah, right now mbox is only supported for one-off initial scripts
such as scripts/import_vger_from_mbox

mbox is pretty bad for incremental updates, especially if
there's big rewrites going on for Status: flags setting off
inotify/EVFILT_VNODE.

I suppose it could be added, but Maildir is way easier and
faster for incremental updates, since deduplication can slow
things down a bit.

> > > * cron job goes through the monthly mailman archives *by month*
> > > performing a re-creation *only* of the latest month's atom feed
> > > * same cron job adds to a "global" atom file containing "links to the
> > > monthly atom files"
> > > * public-inbox sees that list-of-monthly-atom-files
> > > * public-inbox walks the "tree" of monthly atom files, grabbing each one in turn
> > > * public-inbox loads all messages from all monthly atom files.
> >
> > s/atom/mbox/ and that's close to a planned feature.
> 
> oh superb.
> 
> > I'm not sure why the global index file is necessary, though,
> > since the tree structure is predictable (YYYY/MM or similar)
> 
> i was imagining that there would be a way to reduce network traffic
> however i realise now that you're running the cron job actually on the
> machine, directly on the .mbox file.

Yeah.  I was planning on supporting a HTTP(S)-based scraper,
anyways for pipermail and Google Groups, anyways, but time's
been taken up by other things.

> > public-inbox itself uses the Email::MIME module, which
> > unfortunately requires reading an entire RFC-2822 message into
> > memory (and we only work on one full message at a time).
> 
> *shudder* :)

I get scary attachments, sometimes :<

> okaay, so i'm looking at man public-inbox-config, it says "only
> supports Maildir".  grep the source, there's something about
> PublicInbox::Import.pm?

The supported/stable PublicInbox::V2Writable API mostly matches
the documented PublicInbox::Import one, and v2 is much better
for long-term use or big archives.

scripts/import_vger_from_mbox is probably a good example to
start with.

> ngggh how am i going to get mbox files in / watched?

I'm not sure it's necessary, just yet.

mbox for the initial import, and Maildir for incremental updates
is probably the easiest way to go in your case.  Eventually
HTTPS downloads can be supported (maybe in a few months or by
the end-of-year), and that'll be mbox, anyways.

> thanks eric.

No prob :>

  reply	other threads:[~2020-02-04 22:14 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-04 18:42 setting up mailman-to-atom-converter then atom-to-public-inbox Luke Kenneth Casson Leighton
2020-02-04 20:55 ` Eric Wong
2020-02-04 21:49   ` Luke Kenneth Casson Leighton
2020-02-04 22:14     ` Eric Wong [this message]
     [not found]       ` <CAPweEDy1qTK93pXDKdbT-HqJV184fH7x0hqqJYDTMv_nxvoKqQ@mail.gmail.com>
2020-02-05  0:10         ` Eric Wong
     [not found]           ` <CAPweEDyYA+38B4uc+stMpZ9q6CrHaaAAkkorCuH4ONHmhBXbXg@mail.gmail.com>
2020-02-05  0:43             ` Eric Wong
2020-02-05  1:02               ` Kyle Meyer
2020-02-05  1:04                 ` Eric Wong
2020-03-10  0:07   ` setting up mailman2 and public-inbox Luke Kenneth Casson Leighton
2020-03-11 10:33     ` Eric Wong
2020-03-11 11:58       ` Luke Kenneth Casson Leighton
2020-03-11 12:47         ` Luke Kenneth Casson Leighton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200204221425.GA7973@dcvr \
    --to=e@yhbt.net \
    --cc=lkcl@lkcl.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).