user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Two small issues when importing old archives
@ 2020-02-24 20:45 Leah Neukirchen
  2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
  2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Leah Neukirchen @ 2020-02-24 20:45 UTC (permalink / raw)
  To: meta

Hi,

I've recently imported some sizable archives (~100k messages) of old
mailing lists and noticed some slight inconveniences:

1) RFC5322/822 invalid Date: headers should be parsed more gracefully

Some old mails had Date: headers without time zones, e.g.
Date: Sat, 27 Sep 1997 10:02:32

This results in public-inbox asserting this is the current date.
But this assumption makes no sense (literally every other guess
would be more likely), and also results in these messages showing up
on the first page of the archive.  Furthermore, sorting is then not
stable, pressing F5 make the threads jump around.  I'd recommend
falling back to +0000 instead.

2) Weird From: lines crash the whole import

From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de

This funny line broke import_maildir:

fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
EOF from fast-import:  at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.

I fixed it manually.  (But I think it's actually a valid mail address,
even in this botched state.)  I'm not sure what added the ">", it's
not in the original mail.

(I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)

thx,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-01 23:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen
2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
2020-03-01 23:31   ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong
2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
2020-02-26 10:21   ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).