user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/4] drop Date::Parse dependency
@ 2019-11-29 12:25 Eric Wong
  2019-11-29 12:25 ` [PATCH 1/4] git: async batch interface Eric Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Eric Wong @ 2019-11-29 12:25 UTC (permalink / raw)
  To: meta

I started working on the async batch interface for git many
months ago, but didn't have a good use case for it.  The
comparison test in msgtime_cmp gives me an excuse to start using
it :)

Spam email I don't care about, but there's some differences for
folks that send valid mails:

The new code gets tripped up by 4 digit dates from non-sensical
times (emails from 1904, really?), so maybe some adjustment
is necessary.

There's also some bogus dates from the year 71685, so that's
a case where falling back on the Received: header is a good
choice.

Maybe some more odd caes when checks are done running...

Eric Wong (4):
  git: async batch interface
  add msgtime_cmp maintainer test
  msgtime: drop Date::Parse for RFC2822
  Date::Parse is now optional

 INSTALL                    |   9 ++-
 MANIFEST                   |   2 +
 Makefile.PL                |   1 -
 TODO                       |   4 -
 ci/deps.perl               |   2 +-
 lib/PublicInbox/Admin.pm   |   2 +-
 lib/PublicInbox/Git.pm     |  94 +++++++++++++++++-----
 lib/PublicInbox/MDA.pm     |   5 +-
 lib/PublicInbox/MsgTime.pm | 118 +++++++++++++++++++++++----
 t/msgtime.t                |   7 ++
 xt/git_async_cmp.t         |  61 ++++++++++++++
 xt/msgtime_cmp.t           | 161 +++++++++++++++++++++++++++++++++++++
 12 files changed, 415 insertions(+), 51 deletions(-)
 create mode 100644 xt/git_async_cmp.t
 create mode 100644 xt/msgtime_cmp.t


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/4] git: async batch interface
  2019-11-29 12:25 [PATCH 0/4] drop Date::Parse dependency Eric Wong
@ 2019-11-29 12:25 ` Eric Wong
  2019-11-29 12:25 ` [PATCH 2/4] add msgtime_cmp maintainer test Eric Wong
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2019-11-29 12:25 UTC (permalink / raw)
  To: meta

This is a transitionary interface which does NOT require an
event loop.  It can be plugged into in current synchronous code
without major surgery.

It allows HTTP/1.1 pipelining-like functionality by taking
advantage of predictable and well-specified POSIX pipe semantics
by stuffing multiple git cat-file requests into the --batch pipe

With xt/git_async_cmp.t and GIANT_GIT_DIR=git.git, the async
interface is 10-25% faster than the synchronous interface since
it can keep the "git cat-file" process busier.

This is expected to improve performance on systems with slower
storage (but multiple cores).
---
 MANIFEST               |  1 +
 lib/PublicInbox/Git.pm | 94 ++++++++++++++++++++++++++++++++----------
 xt/git_async_cmp.t     | 61 +++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 21 deletions(-)
 create mode 100644 xt/git_async_cmp.t

diff --git a/MANIFEST b/MANIFEST
index 098e656d..ee7dd3e8 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -290,6 +290,7 @@ t/watch_maildir_v2.t
 t/www_listing.t
 t/xcpdb-reshard.t
 xt/git-http-backend.t
+xt/git_async_cmp.t
 xt/nntpd-validate.t
 xt/perf-msgview.t
 xt/perf-nntpd.t
diff --git a/lib/PublicInbox/Git.pm b/lib/PublicInbox/Git.pm
index 218846f3..90595840 100644
--- a/lib/PublicInbox/Git.pm
+++ b/lib/PublicInbox/Git.pm
@@ -16,6 +16,11 @@ use PublicInbox::Tmpfile;
 use base qw(Exporter);
 our @EXPORT_OK = qw(git_unquote git_quote);
 
+use constant MAX_INFLIGHT =>
+	($^O eq 'linux' ? 4096 : POSIX::_POSIX_PIPE_BUF())
+	/
+	65; # SHA-256 hex size + "\n" in preparation for git using non-SHA1
+
 my %GIT_ESC = (
 	a => "\a",
 	b => "\b",
@@ -124,10 +129,50 @@ sub _bidi_pipe {
 	$self->{$in} = $in_r;
 }
 
+sub read_cat_in_full ($$$) {
+	my ($self, $in, $left) = @_;
+	my $offset = 0;
+	my $buf = '';
+	while ($left > 0) {
+		my $r = read($in, $buf, $left, $offset);
+		defined($r) or fail($self, "read failed: $!");
+		$r == 0 and fail($self, 'exited unexpectedly');
+		$left -= $r;
+		$offset += $r;
+	}
+	my $r = read($in, my $lf, 1);
+	defined($r) or fail($self, "read failed: $!");
+	fail($self, 'newline missing after blob') if ($r != 1 || $lf ne "\n");
+	\$buf;
+}
+
+sub _cat_async_step ($$$) {
+	my ($self, $inflight, $in) = @_;
+	my $cb = shift @$inflight or die 'BUG: inflight empty';
+	local $/ = "\n";
+	my $head = $in->getline;
+	return eval { $cb->(undef) } if $head =~ / missing$/;
+
+	$head =~ /^([0-9a-f]{40}) (\S+) ([0-9]+)$/ or
+		fail($self, "Unexpected result from async git cat-file: $head");
+	my ($oid_hex, $type, $size) = ($1, $2, $3 + 0);
+	my $bref = read_cat_in_full($self, $in, $size);
+	eval { $cb->($bref, $oid_hex, $type, $size) };
+}
+
+sub cat_async_wait ($) {
+	my ($self) = @_;
+	my $inflight = delete $self->{inflight} or return;
+	my $in = $self->{in};
+	while (scalar(@$inflight)) {
+		_cat_async_step($self, $inflight, $in);
+	}
+}
+
 sub cat_file {
 	my ($self, $obj, $ref) = @_;
 	my ($retried, $in, $head);
-
+	cat_async_wait($self);
 again:
 	batch_prepare($self);
 	$self->{out}->print($obj, "\n") or fail($self, "write error: $!");
@@ -147,26 +192,8 @@ again:
 		fail($self, "Unexpected result from git cat-file: $head");
 
 	my $size = $1;
-	my $rv;
-	my $left = $size;
 	$$ref = $size if $ref;
-
-	my $offset = 0;
-	my $buf = '';
-	while ($left > 0) {
-		my $r = read($in, $buf, $left, $offset);
-		defined($r) or fail($self, "read failed: $!");
-		$r == 0 and fail($self, 'exited unexpectedly');
-		$left -= $r;
-		$offset += $r;
-	}
-	$rv = \$buf;
-
-	my $r = read($in, my $lf, 1);
-	defined($r) or fail($self, "read failed: $!");
-	fail($self, 'newline missing after blob') if ($r != 1 || $lf ne "\n");
-
-	$rv;
+	read_cat_in_full($self, $in, $size);
 }
 
 sub batch_prepare ($) { _bidi_pipe($_[0], qw(--batch in out pid)) }
@@ -206,9 +233,15 @@ sub _destroy {
 	waitpid($p, 0) if $@; # wait synchronously if not in event loop
 }
 
+sub cat_async_abort ($) {
+	my ($self) = @_;
+	my $inflight = delete $self->{inflight} or die 'BUG: not in async';
+	cleanup($self);
+}
+
 sub fail {
 	my ($self, $msg) = @_;
-	cleanup($self);
+	$self->{inflight} ? cat_async_abort($self) : cleanup($self);
 	die $msg;
 }
 
@@ -278,6 +311,25 @@ sub pub_urls {
 	local_nick($self);
 }
 
+sub cat_async_begin {
+	my ($self) = @_;
+	cleanup($self) if alternates_changed($self);
+	batch_prepare($self);
+	die 'BUG: already in async' if $self->{inflight};
+	$self->{inflight} = [];
+}
+
+sub cat_async ($$$) {
+	my ($self, $oid, $cb) = @_;
+	my $inflight = $self->{inflight} or die 'BUG: not in async';
+	if (scalar(@$inflight) >= MAX_INFLIGHT) {
+		_cat_async_step($self, $inflight, $self->{in});
+	}
+
+	$self->{out}->print($oid, "\n") or fail($self, "write error: $!");
+	push @$inflight, $cb;
+}
+
 sub commit_title ($$) {
 	my ($self, $oid) = @_; # PublicInbox::Git, $sha1hex
 	my $buf = cat_file($self, $oid) or return;
diff --git a/xt/git_async_cmp.t b/xt/git_async_cmp.t
new file mode 100644
index 00000000..f8ffe3d9
--- /dev/null
+++ b/xt/git_async_cmp.t
@@ -0,0 +1,61 @@
+#!perl -w
+# Copyright (C) 2019 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use Benchmark qw(:all);
+use Digest::SHA;
+use PublicInbox::TestCommon;
+my $git_dir = $ENV{GIANT_GIT_DIR};
+plan 'skip_all' => "GIANT_GIT_DIR not defined for $0" unless defined($git_dir);
+use_ok 'PublicInbox::Git';
+my $git = PublicInbox::Git->new($git_dir);
+my @cat = qw(cat-file --buffer --batch-check --batch-all-objects);
+if (require_git(2.19, 1)) {
+	push @cat, '--unordered';
+} else {
+	warn "git <2.19, cat-file lacks --unordered, locality suffers\n";
+}
+my @dig;
+my $nr = $ENV{NR} || 1;
+my $async = timeit($nr, sub {
+	my $dig = Digest::SHA->new(1);
+	my $cb = sub {
+		my ($bref) = @_;
+		$dig->add($$bref);
+	};
+	my $cat = $git->popen(@cat);
+	$git->cat_async_begin;
+
+	foreach (<$cat>) {
+		my ($oid, undef, undef) = split(/ /);
+		$git->cat_async($oid, $cb);
+	}
+	close $cat or die "cat: $?";
+	$git->cat_async_wait;
+	push @dig, ['async', $dig->hexdigest ];
+});
+
+my $sync = timeit($nr, sub {
+	my $dig = Digest::SHA->new(1);
+	my $cat = $git->popen(@cat);
+	foreach (<$cat>) {
+		my ($oid, undef, undef) = split(/ /);
+		my $bref = $git->cat_file($oid);
+		$dig->add($$bref);
+	}
+	close $cat or die "cat: $?";
+	push @dig, ['sync', $dig->hexdigest ];
+});
+
+ok(scalar(@dig) >= 2, 'got some digests');
+my $ref = shift @dig;
+my $exp = $ref->[1];
+isnt($exp, Digest::SHA->new(1)->hexdigest, 'not empty');
+foreach (@dig) {
+	is($_->[1], $exp, "digest matches $_->[0] <=> $ref->[0]");
+}
+diag "sync=".timestr($sync);
+diag "async=".timestr($async);
+done_testing;
+1;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/4] add msgtime_cmp maintainer test
  2019-11-29 12:25 [PATCH 0/4] drop Date::Parse dependency Eric Wong
  2019-11-29 12:25 ` [PATCH 1/4] git: async batch interface Eric Wong
@ 2019-11-29 12:25 ` Eric Wong
  2019-11-29 12:25 ` [PATCH 3/4] msgtime: drop Date::Parse for RFC2822 Eric Wong
  2019-11-29 12:25 ` [PATCH 4/4] Date::Parse is now optional Eric Wong
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2019-11-29 12:25 UTC (permalink / raw)
  To: meta

Changes will be coming for MsgTime to stop depending on
Date::Parse due to lack of package availability on OpenBSD
and suboptimal performance on RFC822 dates.
---
 MANIFEST         |   1 +
 xt/msgtime_cmp.t | 161 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 xt/msgtime_cmp.t

diff --git a/MANIFEST b/MANIFEST
index ee7dd3e8..044bbfef 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -291,6 +291,7 @@ t/www_listing.t
 t/xcpdb-reshard.t
 xt/git-http-backend.t
 xt/git_async_cmp.t
+xt/msgtime_cmp.t
 xt/nntpd-validate.t
 xt/perf-msgview.t
 xt/perf-nntpd.t
diff --git a/xt/msgtime_cmp.t b/xt/msgtime_cmp.t
new file mode 100644
index 00000000..469756c6
--- /dev/null
+++ b/xt/msgtime_cmp.t
@@ -0,0 +1,161 @@
+#!perl -w
+# Copyright (C) 2019 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::MIME;
+use PublicInbox::Inbox;
+use PublicInbox::Git;
+use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
+use POSIX qw(strftime);
+eval { require Date::Parse } or plan skip_all => "Date::Parse missing for $0";
+my $git;
+my ($inboxdir, $git_dir) = @ENV{qw(GIANT_INBOX_DIR GIANT_GIT_DIR)};
+if (defined $inboxdir) {
+	my $ibx = { inboxdir => $inboxdir, name => 'name' };
+	$git = PublicInbox::Inbox->new($ibx)->git;
+} elsif (defined $git_dir) {
+	# sometimes just an old epoch is enough, since newer filters are nicer
+	$git = PublicInbox::Git->new($git_dir);
+} else {
+	plan skip_all => "GIANT_INBOX_DIR not set for $0";
+}
+my @cat = qw(cat-file --buffer --batch-check --batch-all-objects);
+if (require_git(2.19, 1)) {
+	push @cat, '--unordered';
+} else {
+	warn "git <2.19, cat-file lacks --unordered, locality suffers\n";
+}
+
+# millions of "ok" lines are noise, just show mismatches:
+sub quiet_is_deeply ($$$$$) {
+	my ($cur, $old, $func, $oid, $hdr) = @_;
+	if ((scalar(@$cur) != 2) ||
+		(scalar(@$old) == 2 &&
+			($old->[0] != $cur->[0]) ||
+			($old->[1] != $cur->[1]))) {
+		for ($cur, $old) {
+			$_->[2] = strftime('%Y-%m-%d %k:%M:%S', gmtime($_->[0]))
+		}
+		is_deeply($cur, $old, "$func $oid");
+		diag('got: ', explain($cur));
+		diag('exp: ', explain($old));
+		diag $hdr->as_string;
+	}
+}
+
+sub compare {
+	my ($bref, $oid, $type, $size) = @_;
+	local $SIG{__WARN__} = sub { diag "$oid: ", @_ };
+	my $mime = PublicInbox::MIME->new($$bref);
+	my $hdr = $mime->header_obj;
+	my @cur = msg_datestamp($hdr);
+	my @old = Old::msg_datestamp($hdr);
+	quiet_is_deeply(\@cur, \@old, 'datestamp', $oid, $hdr);
+	@cur = msg_timestamp($hdr);
+	@old = Old::msg_timestamp($hdr);
+	quiet_is_deeply(\@cur, \@old, 'timestamp', $oid, $hdr);
+}
+
+my $fh = $git->popen(@cat);
+$git->cat_async_begin;
+while (<$fh>) {
+	my ($oid, $type) = split / /;
+	next if $type ne 'blob';
+	$git->cat_async($oid, *compare);
+}
+$git->cat_async_wait;
+ok(1);
+done_testing;
+
+# old date/time-related functions
+package Old;
+use strict;
+use warnings;
+
+sub str2date_zone ($) {
+	my ($date) = @_;
+
+	my $ts = Date::Parse::str2time($date);
+	return undef unless(defined $ts);
+
+	# off is the time zone offset in seconds from GMT
+	my ($ss,$mm,$hh,$day,$month,$year,$off) = Date::Parse::strptime($date);
+	return undef unless(defined $off);
+
+	# Compute the time zone from offset
+	my $sign = ($off < 0) ? '-' : '+';
+	my $hour = abs(int($off / 3600));
+	my $min  = ($off / 60) % 60;
+
+	# deal with weird offsets like '-0420' properly
+	$min = 60 - $min if ($min && $off < 0);
+
+	my $zone = sprintf('%s%02d%02d', $sign, $hour, $min);
+
+	# "-1200" is the furthest westermost zone offset,
+	# but git fast-import is liberal so we use "-1400"
+	if ($zone >= 1400 || $zone <= -1400) {
+		warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
+		$zone = '+0000';
+	}
+	[$ts, $zone];
+}
+
+sub time_response ($) {
+	my ($ret) = @_;
+	wantarray ? @$ret : $ret->[0];
+}
+
+sub msg_received_at ($) {
+	my ($hdr) = @_; # Email::MIME::Header
+	my @recvd = $hdr->header_raw('Received');
+	my ($ts);
+	foreach my $r (@recvd) {
+		$r =~ /\s*([0-9]+\s+[a-zA-Z]+\s+[0-9]{2,4}\s+
+			[0-9]+[^0-9][0-9]+(?:[^0-9][0-9]+)
+			\s+([\+\-][0-9]+))/sx or next;
+		$ts = eval { str2date_zone($1) } and return $ts;
+		my $mid = $hdr->header_raw('Message-ID');
+		warn "no date in $mid Received: $r\n";
+	}
+	undef;
+}
+
+sub msg_date_only ($) {
+	my ($hdr) = @_; # Email::MIME::Header
+	my @date = $hdr->header_raw('Date');
+	my ($ts);
+	foreach my $d (@date) {
+		# Y2K problems: 3-digit years
+		$d =~ s!([A-Za-z]{3}) ([0-9]{3}) ([0-9]{2}:[0-9]{2}:[0-9]{2})!
+			my $yyyy = $2 + 1900; "$1 $yyyy $3"!e;
+		$ts = eval { str2date_zone($d) } and return $ts;
+		if ($@) {
+			my $mid = $hdr->header_raw('Message-ID');
+			warn "bad Date: $d in $mid: $@\n";
+		}
+	}
+	undef;
+}
+
+# Favors Received header for sorting globally
+sub msg_timestamp ($) {
+	my ($hdr) = @_; # Email::MIME::Header
+	my $ret;
+	$ret = msg_received_at($hdr) and return time_response($ret);
+	$ret = msg_date_only($hdr) and return time_response($ret);
+	wantarray ? (time, '+0000') : time;
+}
+
+# Favors the Date: header for display and sorting within a thread
+sub msg_datestamp ($) {
+	my ($hdr) = @_; # Email::MIME::Header
+	my $ret;
+	$ret = msg_date_only($hdr) and return time_response($ret);
+	$ret = msg_received_at($hdr) and return time_response($ret);
+	wantarray ? (time, '+0000') : time;
+}
+
+1;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/4] msgtime: drop Date::Parse for RFC2822
  2019-11-29 12:25 [PATCH 0/4] drop Date::Parse dependency Eric Wong
  2019-11-29 12:25 ` [PATCH 1/4] git: async batch interface Eric Wong
  2019-11-29 12:25 ` [PATCH 2/4] add msgtime_cmp maintainer test Eric Wong
@ 2019-11-29 12:25 ` Eric Wong
  2019-11-29 12:25 ` [PATCH 4/4] Date::Parse is now optional Eric Wong
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2019-11-29 12:25 UTC (permalink / raw)
  To: meta

Date::Parse is not optimized for RFC2822 dates and isn't
packaged on OpenBSD.  It's still useful for historical
email when email clients were less conformant, but is
less relevant for new emails.
---
 lib/PublicInbox/MsgTime.pm | 118 ++++++++++++++++++++++++++++++++-----
 t/msgtime.t                |   7 +++
 2 files changed, 109 insertions(+), 16 deletions(-)

diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm
index 7dec48ce..479aaa4e 100644
--- a/lib/PublicInbox/MsgTime.pm
+++ b/lib/PublicInbox/MsgTime.pm
@@ -7,28 +7,117 @@ use strict;
 use warnings;
 use base qw(Exporter);
 our @EXPORT_OK = qw(msg_timestamp msg_datestamp);
-use Date::Parse qw(str2time strptime);
+use Time::Local qw(timegm);
+my @MoY = qw(january february march april may june
+		july august september october november december);
+my %MoY;
+@MoY{@MoY} = (0..11);
+@MoY{map { substr($_, 0, 3) } @MoY} = (0..11);
+
+my %OBSOLETE_TZ = ( # RFC2822 4.3 (Obsolete Date and Time)
+	EST => '-0500', EDT => '-0400',
+	CST => '-0600', CDT => '-0500',
+	MST => '-0700', MDT => '-0600',
+	PST => '-0800', PDT => '-0700',
+	UT => '+0000', GMT => '+0000', Z => '+0000',
+
+	# RFC2822 states:
+	#   The 1 character military time zones were defined in a non-standard
+	#   way in [RFC822] and are therefore unpredictable in their meaning.
+);
+my $OBSOLETE_TZ = join('|', keys %OBSOLETE_TZ);
 
 sub str2date_zone ($) {
 	my ($date) = @_;
+	my ($ts, $zone);
 
-	my $ts = str2time($date);
-	return undef unless(defined $ts);
+	# RFC822 is most likely for email, but we can tolerate an extra comma
+	# or punctuation as long as all the data is there.
+	# We'll use '\s' since Unicode spaces won't affect our parsing.
+	# SpamAssassin ignores commas and redundant spaces, too.
+	if ($date =~ /(?:[A-Za-z]+,?\s+)? # day-of-week
+			([0-9]+),?\s+  # dd
+			([A-Za-z]+)\s+ # mon
+			([0-9]{2,})\s+ # YYYY or YY (or YYY :P)
+			([0-9]+)[:\.] # HH:
+				((?:[0-9]{2})|(?:\s?[0-9])) # MM
+				(?:[:\.]((?:[0-9]{2})|(?:\s?[0-9])))? # :SS
+			\s+	# a TZ offset is required:
+				([\+\-])? # TZ sign
+				[\+\-]* # I've seen extra "-" e.g. "--500"
+				([0-9]+|$OBSOLETE_TZ)(?:\s|$) # TZ offset
+			/xo) {
+		my ($dd, $m, $yyyy, $hh, $mm, $ss, $sign, $tz) =
+					($1, $2, $3, $4, $5, $6, $7, $8);
+		# don't accept non-English months
+		defined(my $mon = $MoY{lc($m)}) or return;
 
-	# off is the time zone offset in seconds from GMT
-	my ($ss,$mm,$hh,$day,$month,$year,$off) = strptime($date);
-	return undef unless(defined $off);
+		if (defined(my $off = $OBSOLETE_TZ{$tz})) {
+			$sign = substr($off, 0, 1);
+			$tz = substr($off, 1);
+		}
 
-	# Compute the time zone from offset
-	my $sign = ($off < 0) ? '-' : '+';
-	my $hour = abs(int($off / 3600));
-	my $min  = ($off / 60) % 60;
+		# Y2K problems: 3-digit years, follow RFC2822
+		if (length($yyyy) <= 3) {
+			$yyyy += 1900;
 
-	# deal with weird offsets like '-0420' properly
-	$min = 60 - $min if ($min && $off < 0);
+			# and 2-digit years from '09 (2009) (0..49)
+			$yyyy += 100 if $yyyy < 1950;
+		}
 
-	my $zone = sprintf('%s%02d%02d', $sign, $hour, $min);
+		$ts = timegm($ss // 0, $mm, $hh, $dd, $mon, $yyyy);
+
+		# Compute the time offset from [+-]HHMM
+		$tz //= 0;
+		my ($tz_hh, $tz_mm);
+		if (length($tz) == 1) {
+			$tz_hh = $tz;
+			$tz_mm = 0;
+		} elsif (length($tz) == 2) {
+			$tz_hh = 0;
+			$tz_mm = $tz;
+		} else {
+			$tz_hh = $tz;
+			$tz_hh =~ s/([0-9]{2})\z//;
+			$tz_mm = $1;
+		}
+		while ($tz_mm >= 60) {
+			$tz_mm -= 60;
+			$tz_hh += 1;
+		}
+		$sign //= '+';
+		my $off = $sign . ($tz_mm * 60 + ($tz_hh * 60 * 60));
+		$ts -= $off;
+		$sign = '+' if $off == 0;
+		$zone = sprintf('%s%02d%02d', $sign, $tz_hh, $tz_mm);
+
+	# Time::Zone and Date::Parse are part of the same distibution,
+	# and we need Time::Zone to deal with tz names like "EDT"
+	} elsif (eval { require Date::Parse }) {
+		$ts = Date::Parse::str2time($date);
+		return undef unless(defined $ts);
+
+		# off is the time zone offset in seconds from GMT
+		my ($ss,$mm,$hh,$day,$month,$year,$off) =
+					Date::Parse::strptime($date);
+		return undef unless(defined $off);
+
+		# Compute the time zone from offset
+		my $sign = ($off < 0) ? '-' : '+';
+		my $hour = abs(int($off / 3600));
+		my $min  = ($off / 60) % 60;
+
+		# deal with weird offsets like '-0420' properly
+		$min = 60 - $min if ($min && $off < 0);
+
+		$zone = sprintf('%s%02d%02d', $sign, $hour, $min);
+	} else {
+		warn "Date::Parse missing for non-RFC822 date: $date\n";
+		return undef;
+	}
 
+	# Note: we've already applied the offset to $ts at this point,
+	# but we want to keep "git fsck" happy.
 	# "-1200" is the furthest westermost zone offset,
 	# but git fast-import is liberal so we use "-1400"
 	if ($zone >= 1400 || $zone <= -1400) {
@@ -63,9 +152,6 @@ sub msg_date_only ($) {
 	my @date = $hdr->header_raw('Date');
 	my ($ts);
 	foreach my $d (@date) {
-		# Y2K problems: 3-digit years
-		$d =~ s!([A-Za-z]{3}) ([0-9]{3}) ([0-9]{2}:[0-9]{2}:[0-9]{2})!
-			my $yyyy = $2 + 1900; "$1 $yyyy $3"!e;
 		$ts = eval { str2date_zone($d) } and return $ts;
 		if ($@) {
 			my $mid = $hdr->header_raw('Message-ID');
diff --git a/t/msgtime.t b/t/msgtime.t
index cecbb921..1452dc97 100644
--- a/t/msgtime.t
+++ b/t/msgtime.t
@@ -97,4 +97,11 @@ is_datestamp('Wed, 20 Dec 2006 05:32:58 +0420', [1166577178, '+0420']);
 is_datestamp('Thu, 14 Dec 2006 00:20:24 +0480', [1166036424, '+0520']);
 is_datestamp('Thu, 14 Dec 2006 00:20:24 -0480', [1166074824, '-0520']);
 is_datestamp('Mon, 14 Apr 2014 07:59:01 -0007', [1397462761, '-0007']);
+
+# obsolete formats described in RFC2822
+for (qw(UT GMT Z)) {
+	is_datestamp('Fri, 02 Oct 1993 00:00:00 '.$_, [ 749520000, '+0000']);
+}
+is_datestamp('Fri, 02 Oct 1993 00:00:00 EDT', [ 749534400, '-0400']);
+
 done_testing();

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/4] Date::Parse is now optional
  2019-11-29 12:25 [PATCH 0/4] drop Date::Parse dependency Eric Wong
                   ` (2 preceding siblings ...)
  2019-11-29 12:25 ` [PATCH 3/4] msgtime: drop Date::Parse for RFC2822 Eric Wong
@ 2019-11-29 12:25 ` Eric Wong
  2019-12-01 22:04   ` [PATCH 5/4] msgtime: avoid obviously out-of-range dates (for now) Eric Wong
  3 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2019-11-29 12:25 UTC (permalink / raw)
  To: meta

-mda should not be dealing with broken Date: headers
nowadays, and deprioritize it in our documentation and
internal checks.
---
 INSTALL                  | 9 +++++----
 Makefile.PL              | 1 -
 TODO                     | 4 ----
 ci/deps.perl             | 2 +-
 lib/PublicInbox/Admin.pm | 2 +-
 lib/PublicInbox/MDA.pm   | 5 ++---
 6 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/INSTALL b/INSTALL
index 4d54e6a0..d41e513d 100644
--- a/INSTALL
+++ b/INSTALL
@@ -32,10 +32,6 @@ To accept incoming mail into a public inbox, you'll likely want:
 
 Beyond that, there is a long list of Perl modules required, starting with:
 
-* Date::Parse                      deb: libtimedate-perl
-                                   pkg: p5-TimeDate
-                                   rpm: perl-TimeDate
-
 * Digest::SHA                      typically installed with Perl
                                    rpm: perl-Digest-SHA
 
@@ -81,6 +77,11 @@ Numerous optional modules are likely to be useful as well:
                                    (speeds up process spawning on Linux,
                                     see public-inbox-daemon(8))
 
+- Date::Parse                      deb: libtimedate-perl
+                                   pkg: p5-TimeDate
+                                   rpm: perl-TimeDate
+                                   (for broken, mostly historical emails)
+
 - Plack::Middleware::ReverseProxy  deb: libplack-middleware-reverseproxy-perl
                                    pkg: p5-Plack-Middleware-ReverseProxy
                                    rpm: perl-Plack-Middleware-ReverseProxy
diff --git a/Makefile.PL b/Makefile.PL
index 4aa0caa7..c211bd5d 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -28,7 +28,6 @@ WriteMakefile(
 		# note: we use spamc(1), NOT the Perl modules
 		# We also depend on git.
 		# Keep this sorted and synced to the INSTALL document
-		'Date::Parse' => 0,
 
 		# libperl$PERL_VERSION,
 		# `perl5' on FreeBSD
diff --git a/TODO b/TODO
index 922163f8..592e306c 100644
--- a/TODO
+++ b/TODO
@@ -104,10 +104,6 @@ all need to be considered for everything we introduce)
 * imperfect scraper importers for obfuscated list archives
   (e.g. obfuscated Mailman stuff, Google Groups, etc...)
 
-* consider using HTTP::Date instead of Date::Parse, since we need the
-  former is capable of parsing RFC822-ish dates, used by Plack, and
-  the latter is missing from OpenBSD and maybe other distros.
-
 * improve performance and avoid head-of-line blocking on slow storage
 
 * share "git cat-file --batch" processes across inboxes to avoid
diff --git a/ci/deps.perl b/ci/deps.perl
index ae6083b9..330ba2f3 100755
--- a/ci/deps.perl
+++ b/ci/deps.perl
@@ -19,7 +19,6 @@ my $profiles = {
 	essential => [ qw(
 		git
 		perl
-		Date::Parse
 		Devel::Peek
 		Digest::SHA
 		Email::Simple
@@ -34,6 +33,7 @@ my $profiles = {
 
 	# everything optional for normal use
 	optional => [ qw(
+		Date::Parse
 		BSD::Resource
 		DBD::SQLite
 		DBI
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index dddeeae9..3d0d80b9 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -136,7 +136,7 @@ EOF
 }
 
 # TODO: make Devel::Peek optional, only used for daemon
-my @base_mod = qw(Email::MIME Date::Parse Devel::Peek);
+my @base_mod = qw(Email::MIME Devel::Peek);
 my @over_mod = qw(DBD::SQLite DBI);
 my %mod_groups = (
 	-index => [ @base_mod, @over_mod ],
diff --git a/lib/PublicInbox/MDA.pm b/lib/PublicInbox/MDA.pm
index b0dfac45..ef5e7dfa 100644
--- a/lib/PublicInbox/MDA.pm
+++ b/lib/PublicInbox/MDA.pm
@@ -6,7 +6,7 @@ package PublicInbox::MDA;
 use strict;
 use warnings;
 use Email::Simple;
-use Date::Parse qw(strptime);
+use PublicInbox::MsgTime;
 use constant MAX_SIZE => 1024 * 500; # same as spamc default, should be tunable
 use constant MAX_MID_SIZE => 244; # max term size - 1 in Xapian
 
@@ -51,8 +51,7 @@ sub usable_str {
 }
 
 sub usable_date {
-	my @t = eval { strptime(@_) };
-	scalar @t;
+	defined(eval { PublicInbox::MsgTime::str2date_zone($_[0]) });
 }
 
 sub alias_specified {

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5/4] msgtime: avoid obviously out-of-range dates (for now)
  2019-11-29 12:25 ` [PATCH 4/4] Date::Parse is now optional Eric Wong
@ 2019-12-01 22:04   ` Eric Wong
  2019-12-12  3:42     ` Eric Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2019-12-01 22:04 UTC (permalink / raw)
  To: meta

Wacky dates show up in lore for valid messages.  Lets ignore
them and let future generations deal with Y10K and time-travel
problems.
---
 lib/PublicInbox/MsgTime.pm |  6 +++++-
 t/msgtime.t                | 14 ++++++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm
index 479aaa4ecf132..9f4326442dd11 100644
--- a/lib/PublicInbox/MsgTime.pm
+++ b/lib/PublicInbox/MsgTime.pm
@@ -38,7 +38,7 @@ sub str2date_zone ($) {
 	if ($date =~ /(?:[A-Za-z]+,?\s+)? # day-of-week
 			([0-9]+),?\s+  # dd
 			([A-Za-z]+)\s+ # mon
-			([0-9]{2,})\s+ # YYYY or YY (or YYY :P)
+			([0-9]{2,4})\s+ # YYYY or YY (or YYY :P)
 			([0-9]+)[:\.] # HH:
 				((?:[0-9]{2})|(?:\s?[0-9])) # MM
 				(?:[:\.]((?:[0-9]{2})|(?:\s?[0-9])))? # :SS
@@ -67,6 +67,10 @@ sub str2date_zone ($) {
 
 		$ts = timegm($ss // 0, $mm, $hh, $dd, $mon, $yyyy);
 
+		# 4-digit dates in non-spam from 1900s and 1910s exist in
+		# lore archives
+		return if $ts < 0;
+
 		# Compute the time offset from [+-]HHMM
 		$tz //= 0;
 		my ($tz_hh, $tz_mm);
diff --git a/t/msgtime.t b/t/msgtime.t
index 1452dc97d5b0b..cecad775769e1 100644
--- a/t/msgtime.t
+++ b/t/msgtime.t
@@ -5,7 +5,7 @@ use warnings;
 use Test::More;
 use PublicInbox::MIME;
 use PublicInbox::MsgTime;
-
+our $received_date = 'Mon, 22 Jan 2007 13:16:24 -0500';
 sub datestamp ($) {
 	my ($date) = @_;
 	local $SIG{__WARN__} = sub {};  # Suppress warnings
@@ -17,7 +17,11 @@ sub datestamp ($) {
 			Subject => 'this is a subject',
 			'Message-ID' => '<a@example.com>',
 			Date => $date,
-			'Received' => '(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);\n\tMon, 22 Jan 2007 13:16:24 -0500',
+			'Received' => <<EOF,
+(majordomo\@vger.kernel.org) by vger.kernel.org via listexpand
+\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);
+\t$received_date
+EOF
 		],
 		body => "hello world\n",
 	    );
@@ -104,4 +108,10 @@ for (qw(UT GMT Z)) {
 }
 is_datestamp('Fri, 02 Oct 1993 00:00:00 EDT', [ 749534400, '-0400']);
 
+# fallback to Received: header if Date: is out-of-range:
+is_datestamp('Fri, 1 Jan 1904 10:12:31 +0100',
+	PublicInbox::MsgTime::str2date_zone($received_date));
+is_datestamp('Fri, 9 Mar 71685 18:45:56 +0000', # Y10K is not my problem :P
+	PublicInbox::MsgTime::str2date_zone($received_date));
+
 done_testing();

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 5/4] msgtime: avoid obviously out-of-range dates (for now)
  2019-12-01 22:04   ` [PATCH 5/4] msgtime: avoid obviously out-of-range dates (for now) Eric Wong
@ 2019-12-12  3:42     ` Eric Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2019-12-12  3:42 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> diff --git a/t/msgtime.t b/t/msgtime.t
> index 1452dc97d5b0b..cecad775769e1 100644
> --- a/t/msgtime.t
> +++ b/t/msgtime.t
<snip>
> @@ -17,7 +17,11 @@ sub datestamp ($) {
>  			Subject => 'this is a subject',
>  			'Message-ID' => '<a@example.com>',
>  			Date => $date,
> -			'Received' => '(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);\n\tMon, 22 Jan 2007 13:16:24 -0500',
> +			'Received' => <<EOF,
> +(majordomo\@vger.kernel.org) by vger.kernel.org via listexpand
> +\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);
> +\t$received_date
> +EOF

Oops, accidental interpolation of @1 :x  Will squash this before
pushing:

diff --git a/t/msgtime.t b/t/msgtime.t
index cecad775..98cf66e6 100644
--- a/t/msgtime.t
+++ b/t/msgtime.t
@@ -19,7 +19,7 @@ sub datestamp ($) {
 			Date => $date,
 			'Received' => <<EOF,
 (majordomo\@vger.kernel.org) by vger.kernel.org via listexpand
-\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);
+\tid S932173AbXAVSQY (ORCPT <rfc822;w\@1wt.eu>);
 \t$received_date
 EOF
 		],

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-12-12  3:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-29 12:25 [PATCH 0/4] drop Date::Parse dependency Eric Wong
2019-11-29 12:25 ` [PATCH 1/4] git: async batch interface Eric Wong
2019-11-29 12:25 ` [PATCH 2/4] add msgtime_cmp maintainer test Eric Wong
2019-11-29 12:25 ` [PATCH 3/4] msgtime: drop Date::Parse for RFC2822 Eric Wong
2019-11-29 12:25 ` [PATCH 4/4] Date::Parse is now optional Eric Wong
2019-12-01 22:04   ` [PATCH 5/4] msgtime: avoid obviously out-of-range dates (for now) Eric Wong
2019-12-12  3:42     ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).