From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id CACC21F619; Tue, 25 Feb 2020 09:23:03 +0000 (UTC) Date: Tue, 25 Feb 2020 09:23:03 +0000 From: Eric Wong To: Leah Neukirchen Cc: meta@public-inbox.org, "Eric W. Biederman" Subject: [RFC] msgtime: do not require tz offset with Date::Parse fallback Message-ID: <20200225092303.GA382@dcvr> References: <87h7zfemur.fsf@vuxu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <87h7zfemur.fsf@vuxu.org> List-Id: Leah Neukirchen wrote: > Hi, > > I've recently imported some sizable archives (~100k messages) of old > mailing lists and noticed some slight inconveniences: Thanks for the reports, will answer 2. separately. > 1) RFC5322/822 invalid Date: headers should be parsed more gracefully > > Some old mails had Date: headers without time zones, e.g. > Date: Sat, 27 Sep 1997 10:02:32 > > This results in public-inbox asserting this is the current date. > But this assumption makes no sense (literally every other guess > would be more likely), and also results in these messages showing up > on the first page of the archive. Furthermore, sorting is then not > stable, pressing F5 make the threads jump around. I'd recommend > falling back to +0000 instead. I think a fallback to +0000 makes sense, too. It's not a new bug in 1.3.0 (which makes Date::Parse optional). Looks like that regression was introduced a while ago in commit ae80a3fdb53d70142624f2691ed8ed84eddda66b ("MsgTime.pm: Use strptime to compute the time zone") Cc-ing Eric W. Biederman in case he has any input on this. Now, I'm not sure if this fallback is worth adding for users without Date::Parse. The non-Date::Parse path is a bit faster and optimized for common (correct) dates... ------------8<------------ Subject: [RFC] msgtime: assume +0000 if TZ missing when using Date::Parse Reported-by: Leah Neukirchen Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone") --- lib/PublicInbox/MsgTime.pm | 3 ++- t/msgtime.t | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm index 8eee9a75..8703d7bc 100644 --- a/lib/PublicInbox/MsgTime.pm +++ b/lib/PublicInbox/MsgTime.pm @@ -104,7 +104,8 @@ sub str2date_zone ($) { # off is the time zone offset in seconds from GMT my ($ss,$mm,$hh,$day,$month,$year,$off) = Date::Parse::strptime($date); - return undef unless(defined $off); + return unless defined($year); + $off //= 0; # Compute the time zone from offset my $sign = ($off < 0) ? '-' : '+'; diff --git a/t/msgtime.t b/t/msgtime.t index 5c4636a2..7c95e547 100644 --- a/t/msgtime.t +++ b/t/msgtime.t @@ -5,6 +5,8 @@ use warnings; use Test::More; use PublicInbox::MIME; use PublicInbox::MsgTime; +use PublicInbox::TestCommon; + our $received_date = 'Mon, 22 Jan 2007 13:16:24 -0500'; sub datestamp ($) { my ($date) = @_; @@ -102,6 +104,11 @@ is_datestamp('Thu, 14 Dec 2006 00:20:24 +0480', [1166036424, '+0520']); is_datestamp('Thu, 14 Dec 2006 00:20:24 -0480', [1166074824, '-0520']); is_datestamp('Mon, 14 Apr 2014 07:59:01 -0007', [1397462761, '-0007']); +SKIP: { + require_mods('Date::Parse', 1); + is_datestamp('Sat, 27 Sep 1997 10:02:32', [875354552, '+0000']); +} + # obsolete formats described in RFC2822 for (qw(UT GMT Z)) { is_datestamp('Fri, 02 Oct 1993 00:00:00 '.$_, [ 749520000, '+0000']);