user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Leah Neukirchen <leah@vuxu.org>
To: meta@public-inbox.org
Subject: Two small issues when importing old archives
Date: Mon, 24 Feb 2020 21:45:00 +0100	[thread overview]
Message-ID: <87h7zfemur.fsf@vuxu.org> (raw)

Hi,

I've recently imported some sizable archives (~100k messages) of old
mailing lists and noticed some slight inconveniences:

1) RFC5322/822 invalid Date: headers should be parsed more gracefully

Some old mails had Date: headers without time zones, e.g.
Date: Sat, 27 Sep 1997 10:02:32

This results in public-inbox asserting this is the current date.
But this assumption makes no sense (literally every other guess
would be more likely), and also results in these messages showing up
on the first page of the archive.  Furthermore, sorting is then not
stable, pressing F5 make the threads jump around.  I'd recommend
falling back to +0000 instead.

2) Weird From: lines crash the whole import

From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de

This funny line broke import_maildir:

fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
EOF from fast-import:  at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.

I fixed it manually.  (But I think it's actually a valid mail address,
even in this botched state.)  I'm not sure what added the ">", it's
not in the original mail.

(I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)

thx,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

             reply	other threads:[~2020-02-24 20:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24 20:45 Leah Neukirchen [this message]
2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
2020-03-01 23:31   ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong
2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
2020-02-26 10:21   ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h7zfemur.fsf@vuxu.org \
    --to=leah@vuxu.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).