user/dev discussion of public-inbox itself
 help / color / Atom feed
* Two small issues when importing old archives
@ 2020-02-24 20:45 Leah Neukirchen
  2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
  2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
  0 siblings, 2 replies; 5+ messages in thread
From: Leah Neukirchen @ 2020-02-24 20:45 UTC (permalink / raw)
  To: meta

Hi,

I've recently imported some sizable archives (~100k messages) of old
mailing lists and noticed some slight inconveniences:

1) RFC5322/822 invalid Date: headers should be parsed more gracefully

Some old mails had Date: headers without time zones, e.g.
Date: Sat, 27 Sep 1997 10:02:32

This results in public-inbox asserting this is the current date.
But this assumption makes no sense (literally every other guess
would be more likely), and also results in these messages showing up
on the first page of the archive.  Furthermore, sorting is then not
stable, pressing F5 make the threads jump around.  I'd recommend
falling back to +0000 instead.

2) Weird From: lines crash the whole import

From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de

This funny line broke import_maildir:

fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
EOF from fast-import:  at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.

I fixed it manually.  (But I think it's actually a valid mail address,
even in this botched state.)  I'm not sure what added the ">", it's
not in the original mail.

(I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)

thx,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC] msgtime: do not require tz offset with Date::Parse fallback
  2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen
@ 2020-02-25  9:23 ` Eric Wong
  2020-03-01 23:31   ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong
  2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Wong @ 2020-02-25  9:23 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta, Eric W. Biederman

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi,
> 
> I've recently imported some sizable archives (~100k messages) of old
> mailing lists and noticed some slight inconveniences:

Thanks for the reports, will answer 2. separately.

> 1) RFC5322/822 invalid Date: headers should be parsed more gracefully
> 
> Some old mails had Date: headers without time zones, e.g.
> Date: Sat, 27 Sep 1997 10:02:32
> 
> This results in public-inbox asserting this is the current date.
> But this assumption makes no sense (literally every other guess
> would be more likely), and also results in these messages showing up
> on the first page of the archive.  Furthermore, sorting is then not
> stable, pressing F5 make the threads jump around.  I'd recommend
> falling back to +0000 instead.

I think a fallback to +0000 makes sense, too.
It's not a new bug in 1.3.0 (which makes Date::Parse optional).

Looks like that regression was introduced a while ago in
commit ae80a3fdb53d70142624f2691ed8ed84eddda66b
("MsgTime.pm: Use strptime to compute the time zone")

Cc-ing Eric W. Biederman in case he has any input on this.

Now, I'm not sure if this fallback is worth adding for users
without Date::Parse.  The non-Date::Parse path is a bit faster
and optimized for common (correct) dates...

------------8<------------
Subject: [RFC] msgtime: assume +0000 if TZ missing when using Date::Parse

Reported-by: Leah Neukirchen <leah@vuxu.org>
Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone")
---
 lib/PublicInbox/MsgTime.pm | 3 ++-
 t/msgtime.t                | 7 +++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm
index 8eee9a75..8703d7bc 100644
--- a/lib/PublicInbox/MsgTime.pm
+++ b/lib/PublicInbox/MsgTime.pm
@@ -104,7 +104,8 @@ sub str2date_zone ($) {
 		# off is the time zone offset in seconds from GMT
 		my ($ss,$mm,$hh,$day,$month,$year,$off) =
 					Date::Parse::strptime($date);
-		return undef unless(defined $off);
+		return unless defined($year);
+		$off //= 0;
 
 		# Compute the time zone from offset
 		my $sign = ($off < 0) ? '-' : '+';
diff --git a/t/msgtime.t b/t/msgtime.t
index 5c4636a2..7c95e547 100644
--- a/t/msgtime.t
+++ b/t/msgtime.t
@@ -5,6 +5,8 @@ use warnings;
 use Test::More;
 use PublicInbox::MIME;
 use PublicInbox::MsgTime;
+use PublicInbox::TestCommon;
+
 our $received_date = 'Mon, 22 Jan 2007 13:16:24 -0500';
 sub datestamp ($) {
 	my ($date) = @_;
@@ -102,6 +104,11 @@ is_datestamp('Thu, 14 Dec 2006 00:20:24 +0480', [1166036424, '+0520']);
 is_datestamp('Thu, 14 Dec 2006 00:20:24 -0480', [1166074824, '-0520']);
 is_datestamp('Mon, 14 Apr 2014 07:59:01 -0007', [1397462761, '-0007']);
 
+SKIP: {
+	require_mods('Date::Parse', 1);
+	is_datestamp('Sat, 27 Sep 1997 10:02:32', [875354552, '+0000']);
+}
+
 # obsolete formats described in RFC2822
 for (qw(UT GMT Z)) {
 	is_datestamp('Fri, 02 Oct 1993 00:00:00 '.$_, [ 749520000, '+0000']);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* weird From: lines [was: Two small issues when importing old archives]
  2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen
  2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
@ 2020-02-25  9:28 ` Eric Wong
  2020-02-26 10:21   ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Wong @ 2020-02-25  9:28 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> 2) Weird From: lines crash the whole import
> 
> From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de
> 
> This funny line broke import_maildir:
> 
> fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
> fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
> EOF from fast-import:  at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.
> 
> I fixed it manually.  (But I think it's actually a valid mail address,
> even in this botched state.)  I'm not sure what added the ">", it's
> not in the original mail.
> 
> (I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)

Gah, this looks like it's because Email::Address::XS leaves a
"<" in the name...   Perhaps Import should delete all [<>]
characters unconditionally? (or swap in appropriate Unicode
homographs and assume users have the necessary glyphs...)

---------8<----------
Subject: [RFC] t/address.t: dump failing case

"PublicInbox::Address" (w/o "PP") is Email::Address::XS 1.04
from Debian 10:

PublicInbox::Address names: $VAR1 = [
          '=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet'
        ];
PublicInbox::Address emails: $VAR1 = [
          '"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@example.de'
        ];
PublicInbox::AddressPP names: $VAR1 = [
          '=?iso-8859-1?Q?Jochen_K=FCpper?='
        ];
PublicInbox::AddressPP emails: $VAR1 = [
          'usenet"@example.de'
        ];
---
 t/address.t | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/t/address.t b/t/address.t
index 6f4bff6c..8c39f04b 100644
--- a/t/address.t
+++ b/t/address.t
@@ -14,6 +14,11 @@ sub test_pkg {
 		[$emails->('User <e@example.com>, e@example.org')],
 		'address extraction works as expected');
 
+	my $odd = '"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@example.de';
+	use Data::Dumper;
+	diag "$pkg names: " . Dumper([$names->($odd)]);
+	diag "$pkg emails: " . Dumper([$emails->($odd)]);
+
 	is_deeply(['user@example.com'],
 		[$emails->('<user@example.com (Comment)>')],
 		'comment after domain accepted before >');

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] import: drop '<' and '>' characters in addresses
  2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
@ 2020-02-26 10:21   ` Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-02-26 10:21 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Eric Wong <e@yhbt.net> wrote:
> Leah Neukirchen <leah@vuxu.org> wrote:
> > 2) Weird From: lines crash the whole import
> > 
> > From: "=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de
> > 
> > This funny line broke import_maildir:
> > 
> > fatal: Missing > in ident string: =?iso-8859-1?Q?Jochen_K=FCpper?= usenet <"=?iso-8859-1?Q?Jochen_K=FCpper?= <usenet"@jochen-kuepper.de> 1101853296 +0100
> > fast-import: dumping crash report to /var/lib/public-inbox/repositories/ding.git/fast_import_crash_31402
> > EOF from fast-import:  at /usr/share/perl5/vendor_perl/PublicInbox/Import.pm line 96, <$r> line 54681.
> > 
> > I fixed it manually.  (But I think it's actually a valid mail address,
> > even in this botched state.)  I'm not sure what added the ">", it's
> > not in the original mail.
> > 
> > (I use public-inbox-1.3.0/git-2.25.0 on Void Linux.)
> 
> Gah, this looks like it's because Email::Address::XS leaves a
> "<" in the name...   Perhaps Import should delete all [<>]
> characters unconditionally? (or swap in appropriate Unicode
> homographs and assume users have the necessary glyphs...)

So we already do `$name =~ tr/<>//d', so I think doing the same
with `$email' is appropiate for fast-import.  The "correct"
address featuring '<' will still be indexed in Xapian, at least.

-------------8<-------------
Subject: [PATCH] import: drop '<' and '>' characters in addresses

Some strange "From:" lines will cause Email::Address::XS to
leave '<' (and presumably '>') in the address which
git-fast-import won't accept even if quoted.  Workaround this
problem by deleting '<' and '>' the same way we delete them for
the ident name.

Reported-by: Leah Neukirchen <leah@vuxu.org>
Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
---
 lib/PublicInbox/Import.pm | 4 ++++
 t/import.t                | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index d8dc49b8..68dc0c7e 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -293,6 +293,10 @@ sub extract_cmt_info ($) {
 		}
 	}
 	if (defined $email) {
+		# Email::Address::XS may leave quoted '<' in addresses,
+		# which git-fast-import doesn't like
+		$email =~ tr/<>//d;
+
 		# quiet down wide character warnings with utf8::encode
 		utf8::encode($email);
 	} else {
diff --git a/t/import.t b/t/import.t
index e71dd714..b88d308e 100644
--- a/t/import.t
+++ b/t/import.t
@@ -55,6 +55,8 @@ $im->done;
 my @revs = $git->qx(qw(rev-list HEAD));
 is(scalar @revs, 1, 'one revision created');
 
+my $odd = '"=?iso-8859-1?Q?J_K=FCpper?= <usenet"@example.de';
+$mime->header_set('From', $odd);
 $mime->header_set('Message-ID', '<b@example.com>');
 $mime->header_set('Subject', 'msg2');
 like($im->add($mime, sub { $mime }), qr/\A:\d+\z/, 'added 2nd message');

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse
  2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
@ 2020-03-01 23:31   ` Eric Wong
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-03-01 23:31 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta, Eric W. Biederman

Eric Wong <e@yhbt.net> wrote:
> Leah Neukirchen <leah@vuxu.org> wrote:
> > 1) RFC5322/822 invalid Date: headers should be parsed more gracefully
> >
> > Some old mails had Date: headers without time zones, e.g.
> > Date: Sat, 27 Sep 1997 10:02:32
> >
> > This results in public-inbox asserting this is the current date.
> > But this assumption makes no sense (literally every other guess
> > would be more likely), and also results in these messages showing up
> > on the first page of the archive.  Furthermore, sorting is then not
> > stable, pressing F5 make the threads jump around.  I'd recommend
> > falling back to +0000 instead.
>
> I think a fallback to +0000 makes sense, too.
> It's not a new bug in 1.3.0 (which makes Date::Parse optional).
>
> Looks like that regression was introduced a while ago in
> commit ae80a3fdb53d70142624f2691ed8ed84eddda66b
> ("MsgTime.pm: Use strptime to compute the time zone")
>
> Cc-ing Eric W. Biederman in case he has any input on this.

> ------------8<------------
> Subject: [RFC] msgtime: assume +0000 if TZ missing when using Date::Parse

Pushed as commit d857e7dc0d816b635a7ead09c3273f8c2d2434be
with a more descriptive commit message:

    msgtime: assume +0000 if TZ missing when using Date::Parse

    Some old emails don't have timezone offsets, since our
    Date::Parse code path takes a liberal interpretation of dates,
    fallback to using "+0000" as the timezone offset since it's
    closer to the actual date of the message than whatever the
    current date is.

    Reported-by: Leah Neukirchen <leah@vuxu.org>
    Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
    Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone")

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-24 20:45 Two small issues when importing old archives Leah Neukirchen
2020-02-25  9:23 ` [RFC] msgtime: do not require tz offset with Date::Parse fallback Eric Wong
2020-03-01 23:31   ` [pushed] msgtime: assume +0000 if TZ missing when using Date::Parse Eric Wong
2020-02-25  9:28 ` weird From: lines [was: Two small issues when importing old archives] Eric Wong
2020-02-26 10:21   ` [PATCH] import: drop '<' and '>' characters in addresses Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git