From: Leah Neukirchen <leah@vuxu.org>
To: meta@public-inbox.org
Subject: Two small issues when importing old archives
Date: Mon, 24 Feb 2020 21:45:00 +0100 [thread overview]
Message-ID: <87h7zfemur.fsf@vuxu.org> (raw)
Hi,
I've recently imported some sizable archives (~100k messages) of old
mailing lists and noticed some slight inconveniences:
1) RFC5322/822 invalid Date: headers should be parsed more gracefully
Some old mails had Date: headers without time zones, e.g.
Date: Sat, 27 Sep 1997 10:02:32
This results in public-inbox asserting this is the current date.
But this assumption makes no sense (literally every other guess
would be more likely), and also results in these messages showing up
on the first page of the archive. Furthermore, sorting is then not
stable, pressing F5 make the threads jump around. I'd recommend
falling back to +0000 instead.
2) Weird From: lines crash the whole import
From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de
This funny line broke import_maildir:
fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
EOF from fast-import: at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.
I fixed it manually. (But I think it's actually a valid mail address,
even in this botched state.) I'm not sure what added the ">", it's
not in the original mail.
(I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)
thx,
--
Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org/
next reply other threads:[~2020-02-24 20:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-24 20:45 Leah Neukirchen [this message]
2020-02-25 9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
2020-03-01 23:31 ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong
2020-02-25 9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
2020-02-26 10:21 ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h7zfemur.fsf@vuxu.org \
--to=leah@vuxu.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).