user/dev discussion of public-inbox itself
 help / Atom feed
* Warnings from git fsck after lkml import
@ 2018-07-05  5:40 ebiederm
  2018-07-05 23:13 ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: ebiederm @ 2018-07-05  5:40 UTC (permalink / raw)
  To: meta; +Cc: Eric Wong


It looks like public-inbox has some challenges when importing some
questionable emails.  The import of lkml has resulted in several commits
with bad dates that git fsck complains about.  I have previously
reported this to Konstantin Ryabitsev who maintains kernel.org but since
I have not seen any discussion of this I thought I should report it
directly here as well.

At a practical level these errors initially preventing me from cloning
the repos as in .gitconfig I had:
> [transfer]
>         fsckobjects = true
> [fetch]
>         fsckobjects = true
> [receive]
>         fsckobjects = true


Beyond the cloning issue while I don't expect public-inbox to fix the
emails themselves it should be able to detect and prevent creating
buggy commits.

Importing a large repo like linux-kernel seems like a good test case for
finding these kinds of issues.

git fsck on lkml/git/0.git reports:
> Checking object directories: 100% (256/256), done.
> warning in commit 59173dc1fe67b113ace4ce83e7f522414b3e0404: badTimezone: invalid author/committer line - bad time zone
> warning in commit ff22aaff22eb4479e49e93f697e385f76db51c55: badTimezone: invalid author/committer line - bad time zone
> warning in commit 609b744909693f5f00aff5ed9928beeeee9ded2e: badTimezone: invalid author/committer line - bad time zone
> warning in commit 084572141db8e0d879428afb278bd338f2dbb053: badTimezone: invalid author/committer line - bad time zone
> warning in commit 789d204de27cd12c6da693d903390a241a1a4bca: badTimezone: invalid author/committer line - bad time zone
> warning in commit 0d9a65948b0c957007ca387cd56b690f9bab9c08: badTimezone: invalid author/committer line - bad time zone
> warning in commit f7468c42b4196ee6323afb373ab9323971c38d69: badTimezone: invalid author/committer line - bad time zone
> warning in commit 85e0cd6dd527cd55ad0440f14384529b83818228: badTimezone: invalid author/committer line - bad time zone
> warning in commit f31e19a2e772c9ed00728ef142af9c550ea5de6a: badTimezone: invalid author/committer line - bad time zone
> warning in commit 56eb7384443ef84e17e29504a304a071b189ae67: badTimezone: invalid author/committer line - bad time zone
> warning in commit e4470030471e6810414b9de5e3b52e16f2245d12: badTimezone: invalid author/committer line - bad time zone
> warning in commit f913b48caa097c3b2cb3f491707944f88d52d89f: badTimezone: invalid author/committer line - bad time zone
> warning in commit 4390f26923d572c6dab6cce8282c7cad5520d785: badTimezone: invalid author/committer line - bad time zone
> warning in commit 0f66db71a06bd7d651a0cd80877d8043b70fda20: badTimezone: invalid author/committer line - bad time zone
> warning in commit d71472c40b36dcdf0396afc9778f6137eea45887: badTimezone: invalid author/committer line - bad time zone
> warning in commit e8d3b19a91a2d86b6a91bd19dc811e851398b519: badTimezone: invalid author/committer line - bad time zone
> warning in commit afd9fc0cc87e56ed7736d633e17d0ef77817b3cc: badTimezone: invalid author/committer line - bad time zone
> warning in commit 811b3217708358cf1b75fba4602a64a426fce0f5: badTimezone: invalid author/committer line - bad time zone
> warning in commit e7a751a597c6f5e4770c61bdee6220d55a37cba9: badTimezone: invalid author/committer line - bad time zone
> warning in commit 3e32ad6192fe093e03e6b9346c3a90b16d9905c0: badTimezone: invalid author/committer line - bad time zone
> warning in commit 5e66b47528e79d3bbb769e137f036a1fa99cccf9: badTimezone: invalid author/committer line - bad time zone
> warning in commit d90d67d94ca47142670dff13fcb81ab7afab07bb: badTimezone: invalid author/committer line - bad time zone
> Checking objects: 100% (1711464/1711464), done.
> Checking connectivity: 1711464, done.

git fsck on lkml/git/1.git reports:
> Checking object directories: 100% (256/256), done.
> warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NUL byte in the commit object body
> Checking objects: 100% (1427013/1427013), done.

git fsck on lkml/git/2.git reports:
> Checking object directories: 100% (256/256), done.
> warning in commit 696b4d4e94dfc2c4a8518f642879ed74bc415ecd: badTimezone: invalid author/committer line - bad time zone
> Checking objects: 100% (1366098/1366098), done.

Eric




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-05  5:40 Warnings from git fsck after lkml import ebiederm
@ 2018-07-05 23:13 ` Eric Wong
  2018-07-06  0:36   ` ebiederm
  2018-07-12 18:31   ` Warnings from git fsck after lkml import Konstantin Ryabitsev
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Wong @ 2018-07-05 23:13 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> It looks like public-inbox has some challenges when importing some
> questionable emails.  The import of lkml has resulted in several commits
> with bad dates that git fsck complains about.  I have previously
> reported this to Konstantin Ryabitsev who maintains kernel.org but since
> I have not seen any discussion of this I thought I should report it
> directly here as well.

Thanks for bringing this up publically.

Yes, I early during v2 development I noticed old mails had some
-1400 timezone values (but the furthest is -1200).  I opted to
attempt to preserve the wonky timezones since fast-import
happily accepts -1400 and I didn't anticipate problems...

> At a practical level these errors initially preventing me from cloning
> the repos as in .gitconfig I had:
> > [transfer]
> >         fsckobjects = true
> > [fetch]
> >         fsckobjects = true
> > [receive]
> >         fsckobjects = true

...But I didn't know people cared to set those :x

Now I wonder if git should only warn for bad-but-still-usable
objects on clone, as I wouldn't consider a malformed date to be
on the level as actual FS corruption.  Or at least complete
the clone and fail with a special exit code.

> Beyond the cloning issue while I don't expect public-inbox to fix the
> emails themselves it should be able to detect and prevent creating
> buggy commits.

Right, the emails themselves have wonky dates.  I got public-inbox
to massage the dates into the bare minimum of what fast-import
finds acceptable(*).  fast-import is rather liberal.

> Importing a large repo like linux-kernel seems like a good test case for
> finding these kinds of issues.

Fwiw, linux.git and git.git both warn about missingTaggerEntry
on fsck, yet clone fine with fsckObjects=true.  Maybe clone
should not abort on badTimeZone, either.  *shrug*



(*) In retrospect, especially with v2 which requires SQLite/Xapian,
    I'm thinking it's not even worth the trouble to parse out
    authorship information for git commit headers.  Not sure if
    people would still use things like "git log --author=" for
    v2...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-05 23:13 ` Eric Wong
@ 2018-07-06  0:36   ` ebiederm
  2018-07-06  3:47     ` ebiederm
  2018-07-12 18:31   ` Warnings from git fsck after lkml import Konstantin Ryabitsev
  1 sibling, 1 reply; 13+ messages in thread
From: ebiederm @ 2018-07-06  0:36 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> It looks like public-inbox has some challenges when importing some
>> questionable emails.  The import of lkml has resulted in several commits
>> with bad dates that git fsck complains about.  I have previously
>> reported this to Konstantin Ryabitsev who maintains kernel.org but since
>> I have not seen any discussion of this I thought I should report it
>> directly here as well.
>
> Thanks for bringing this up publically.
>
> Yes, I early during v2 development I noticed old mails had some
> -1400 timezone values (but the furthest is -1200).  I opted to
> attempt to preserve the wonky timezones since fast-import
> happily accepts -1400 and I didn't anticipate problems...

I think 0.git was generated after your earlier fix.

Looking at the commits in question this is a different issue.
On some of the later ones I am really not certain what it is
but here is a representative sample you can look at.

email: Date: Wed, 13 Dec 2006 10:26:38 +1 
git:   Date:   Wed Dec 13 09:27:38 2006 +0001

email: Date: Sun, 10 Sep 2006 23:18:30 +1
git:   Date:   Sun Sep 10 22:19:30 2006 +0001

email: Date: Fri, 3 Feb 2006 18:11:22 -00
git:   Date:   Fri Feb 3 18:11:22 2006 +0000

email: Date: Mon, 06 Jun 2005 13:59:56 +1
git:   Date:   Mon Jun 6 13:00:56 2005 +0001

email: Date: Thursday, 20 Feb 2003 01:14:34 +000
git:   Date:   Thu Feb 20 01:14:34 2003 +0000

# The error in this one does not immediate stand out
commit: 0f66db71a06bd7d651a0cd80877d8043b70fda20
email: Date: Fri, 28 Jun 2002 12:54:40 -700
git:   Date:   Fri Jun 28 12:54:40 2002 -0700

commit e8d3b19a91a2d86b6a91bd19dc811e851398b519
email: Date: Sat, 12 Jan 2002 12:52:57 -200
git:   Date:   Sat Jan 12 12:52:57 2002 -0200

commit afd9fc0cc87e56ed7736d633e17d0ef77817b3cc
email: Date: Thu, 10 Jan 2002 09:56:29 -200
git:   Date:   Thu Jan 10 09:56:29 2002 -0200

commit 811b3217708358cf1b75fba4602a64a426fce0f5
email: Date: Mon, 05 Nov 2001 10:36:16 -800
git:   Date:   Mon Nov 5 10:36:16 2001 -0800

commit e7a751a597c6f5e4770c61bdee6220d55a37cba9
email: Date: Sun, 09 Sep 2001 09:59:05 -800
git:   Date:   Sun Sep 9 09:59:05 2001 -0800

commit 3e32ad6192fe093e03e6b9346c3a90b16d9905c0
email: Date: Monday, 09 Apr 2001 23:50:10 +000
git:   Date:   Mon Apr 9 23:50:10 2001 +0000



>>
>> At a practical level these errors initially preventing me from cloning
>> the repos as in .gitconfig I had:
>> > [transfer]
>> >         fsckobjects = true
>> > [fetch]
>> >         fsckobjects = true
>> > [receive]
>> >         fsckobjects = true
>
> ...But I didn't know people cared to set those :x
>
> Now I wonder if git should only warn for bad-but-still-usable
> objects on clone, as I wouldn't consider a malformed date to be
> on the level as actual FS corruption.  Or at least complete
> the clone and fail with a special exit code.

I have added:
[fsck]
        badTimezone = warn

To hopefully achieve that.


Eric



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-06  0:36   ` ebiederm
@ 2018-07-06  3:47     ` ebiederm
  2018-07-06 21:32       ` [PATCH] MsgTime.pm: Use strptime to compute the time zone ebiederm
  0 siblings, 1 reply; 13+ messages in thread
From: ebiederm @ 2018-07-06  3:47 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

ebiederm@xmission.com (Eric W. Biederman) writes:

> Eric Wong <e@80x24.org> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>> It looks like public-inbox has some challenges when importing some
>>> questionable emails.  The import of lkml has resulted in several commits
>>> with bad dates that git fsck complains about.  I have previously
>>> reported this to Konstantin Ryabitsev who maintains kernel.org but since
>>> I have not seen any discussion of this I thought I should report it
>>> directly here as well.
>>
>> Thanks for bringing this up publically.
>>
>> Yes, I early during v2 development I noticed old mails had some
>> -1400 timezone values (but the furthest is -1200).  I opted to
>> attempt to preserve the wonky timezones since fast-import
>> happily accepts -1400 and I didn't anticipate problems...
>
> I think 0.git was generated after your earlier fix.
>
> Looking at the commits in question this is a different issue.
> On some of the later ones I am really not certain what it is
> but here is a representative sample you can look at.

Except below is looking at the pretty output of git show.
To actually see the problem git show --format=raw is needed.

Which for commit 59173dc1fe67b113ace4ce83e7f522414b3e0404
shows me:
author Dieter Ferdinand <dieter.ferdinand@gmx.de> 1166001998 +1

Which makes it clear the ``timezone'' was passed straight through
without modification.  The date in the email was: "Date: Wed, 13 Dec 2006 10:26:38 +1"

And the problem is the timezone is not a 4 byte number.  I see the same
pattern with the rest of the bad time zone warnings.

So it should be straight forward if the timezone is not 4 digits to just not pass
the time zone through.

Eric

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] MsgTime.pm: Use strptime to compute the time zone
  2018-07-06  3:47     ` ebiederm
@ 2018-07-06 21:32       ` ebiederm
  2018-07-06 22:22         ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: ebiederm @ 2018-07-06 21:32 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


Recently I had trouble cloning lkml/git/0.git because
git fsck on receive was failing.  The output of git fsck was:
> Checking object directories: 100% (256/256), done.
> warning in commit 59173dc1fe67b113ace4ce83e7f522414b3e0404: badTimezone: invalid author/committer line - bad time zone
> warning in commit ff22aaff22eb4479e49e93f697e385f76db51c55: badTimezone: invalid author/committer line - bad time zone
> warning in commit 609b744909693f5f00aff5ed9928beeeee9ded2e: badTimezone: invalid author/committer line - bad time zone
> warning in commit 084572141db8e0d879428afb278bd338f2dbb053: badTimezone: invalid author/committer line - bad time zone
> warning in commit 789d204de27cd12c6da693d903390a241a1a4bca: badTimezone: invalid author/committer line - bad time zone
> warning in commit 0d9a65948b0c957007ca387cd56b690f9bab9c08: badTimezone: invalid author/committer line - bad time zone
> warning in commit f7468c42b4196ee6323afb373ab9323971c38d69: badTimezone: invalid author/committer line - bad time zone
> warning in commit 85e0cd6dd527cd55ad0440f14384529b83818228: badTimezone: invalid author/committer line - bad time zone
> warning in commit f31e19a2e772c9ed00728ef142af9c550ea5de6a: badTimezone: invalid author/committer line - bad time zone
> warning in commit 56eb7384443ef84e17e29504a304a071b189ae67: badTimezone: invalid author/committer line - bad time zone
> warning in commit e4470030471e6810414b9de5e3b52e16f2245d12: badTimezone: invalid author/committer line - bad time zone
> warning in commit f913b48caa097c3b2cb3f491707944f88d52d89f: badTimezone: invalid author/committer line - bad time zone
> warning in commit 4390f26923d572c6dab6cce8282c7cad5520d785: badTimezone: invalid author/committer line - bad time zone
> warning in commit 0f66db71a06bd7d651a0cd80877d8043b70fda20: badTimezone: invalid author/committer line - bad time zone
> warning in commit d71472c40b36dcdf0396afc9778f6137eea45887: badTimezone: invalid author/committer line - bad time zone
> warning in commit e8d3b19a91a2d86b6a91bd19dc811e851398b519: badTimezone: invalid author/committer line - bad time zone
> warning in commit afd9fc0cc87e56ed7736d633e17d0ef77817b3cc: badTimezone: invalid author/committer line - bad time zone
> warning in commit 811b3217708358cf1b75fba4602a64a426fce0f5: badTimezone: invalid author/committer line - bad time zone
> warning in commit e7a751a597c6f5e4770c61bdee6220d55a37cba9: badTimezone: invalid author/committer line - bad time zone
> warning in commit 3e32ad6192fe093e03e6b9346c3a90b16d9905c0: badTimezone: invalid author/committer line - bad time zone
> warning in commit 5e66b47528e79d3bbb769e137f036a1fa99cccf9: badTimezone: invalid author/committer line - bad time zone
> warning in commit d90d67d94ca47142670dff13fcb81ab7afab07bb: badTimezone: invalid author/committer line - bad time zone
> Checking objects: 100% (1711464/1711464), done.
> Checking connectivity: 1711464, done.

Upon examination with git show --pretty=raw all of the problem commits
had a time zone that was not 4 digits long.  This time zone had been
passed straight from the Date line in the email into the author line
of the commit.

Looking into that I discovered that str2time takes into account the
time zone, and was actually able to process these weird time zones.

So get the normalized time zone with strptime and convert it from
seconds from gmt to hours and minutes from gmt.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

 lib/PublicInbox/MsgTime.pm | 41 ++++++++++--------
 t/msgtime.t                | 87 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+), 17 deletions(-)
 create mode 100644 t/msgtime.t

diff --git a/lib/PublicInbox/MsgTime.pm b/lib/PublicInbox/MsgTime.pm
index c67a41fff2ef..f3ebb6447a9c 100644
--- a/lib/PublicInbox/MsgTime.pm
+++ b/lib/PublicInbox/MsgTime.pm
@@ -5,19 +5,31 @@ use strict;
 use warnings;
 use base qw(Exporter);
 our @EXPORT_OK = qw(msg_timestamp msg_datestamp);
-use Date::Parse qw(str2time);
-use Time::Zone qw(tz_offset);
+use Date::Parse qw(str2time strptime);
+
+sub str2date_zone ($) {
+	my ($date) = @_;
+
+	my $ts = str2time($date);
+	return undef unless(defined $ts);
+
+	# off is the time zone offset in seconds from GMT
+	my ($ss,$mm,$hh,$day,$month,$year,$off) = strptime($date);
+	return undef unless(defined $off);
+
+	# Compute the time zone from offset
+	my $sign = ($off < 0) ? '-' : '+';
+	my $hour = abs(int($off / 3600));
+	my $min  = ($off / 60) % 60;
+	my $zone = sprintf('%s%02d%02d', $sign, $hour, $min);
 
-sub zone_clamp ($) {
-	my ($zone) = @_;
-	$zone ||= '+0000';
 	# "-1200" is the furthest westermost zone offset,
 	# but git fast-import is liberal so we use "-1400"
 	if ($zone >= 1400 || $zone <= -1400) {
 		warn "bogus TZ offset: $zone, ignoring and assuming +0000\n";
 		$zone = '+0000';
 	}
-	$zone;
+	[$ts, $zone];
 }
 
 sub time_response ($) {
@@ -28,37 +40,32 @@ sub time_response ($) {
 sub msg_received_at ($) {
 	my ($hdr) = @_; # Email::MIME::Header
 	my @recvd = $hdr->header_raw('Received');
-	my ($ts, $zone);
+	my ($ts);
 	foreach my $r (@recvd) {
-		$zone = undef;
 		$r =~ /\s*(\d+\s+[[:alpha:]]+\s+\d{2,4}\s+
 			\d+\D\d+(?:\D\d+)\s+([\+\-]\d+))/sx or next;
-		$zone = $2;
-		$ts = eval { str2time($1) } and last;
+		$ts = eval { str2date_zone($1) } and return $ts;
 		my $mid = $hdr->header_raw('Message-ID');
 		warn "no date in $mid Received: $r\n";
 	}
-	defined $ts ? [ $ts, zone_clamp($zone) ] : undef;
+	undef;
 }
 
 sub msg_date_only ($) {
 	my ($hdr) = @_; # Email::MIME::Header
 	my @date = $hdr->header_raw('Date');
-	my ($ts, $zone);
+	my ($ts);
 	foreach my $d (@date) {
-		$zone = undef;
 		# Y2K problems: 3-digit years
 		$d =~ s!([A-Za-z]{3}) (\d{3}) (\d\d:\d\d:\d\d)!
 			my $yyyy = $2 + 1900; "$1 $yyyy $3"!e;
-		$ts = eval { str2time($d) };
+		$ts = eval { str2date_zone($d) } and return $ts;
 		if ($@) {
 			my $mid = $hdr->header_raw('Message-ID');
 			warn "bad Date: $d in $mid: $@\n";
-		} elsif ($d =~ /\s+([\+\-]\d+)\s*\z/) {
-			$zone = $1;
 		}
 	}
-	defined $ts ? [ $ts, zone_clamp($zone) ] : undef;
+	undef;
 }
 
 # Favors Received header for sorting globally
diff --git a/t/msgtime.t b/t/msgtime.t
new file mode 100644
index 000000000000..c390670ae01f
--- /dev/null
+++ b/t/msgtime.t
@@ -0,0 +1,87 @@
+# Copyright (C) 2016-2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use PublicInbox::MIME;
+use PublicInbox::MsgTime;
+
+sub datestamp ($) {
+	my ($date) = @_;
+	local $SIG{__WARN__} = sub {};  # Suppress warnings
+	my $mime = PublicInbox::MIME->create(
+		header => [
+			From => 'a@example.com',
+			To => 'b@example.com',
+			'Content-Type' => 'text/plain',
+			Subject => 'this is a subject',
+			'Message-ID' => '<a@example.com>',
+			Date => $date,
+			'Received' => '(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);\n\tMon, 22 Jan 2007 13:16:24 -0500',
+		],
+		body => "hello world\n",
+	    );
+	my @ts = PublicInbox::MsgTime::msg_datestamp($mime->header_obj);
+	return \@ts;
+}
+
+sub timestamp ($) {
+	my ($received) = @_;
+	local $SIG{__WARN__} = sub {};  # Suppress warnings
+	my $mime = PublicInbox::MIME->create(
+		header => [
+			From => 'a@example.com',
+			To => 'b@example.com',
+			'Content-Type' => 'text/plain',
+			Subject => 'this is a subject',
+			'Message-ID' => '<a@example.com>',
+			Date => 'Fri, 02 Oct 1993 00:00:00 +0000',
+			'Received' => '(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S932173AbXAVSQY (ORCPT <rfc822;w@1wt.eu>);\n\t' . $received,
+		],
+		body => "hello world\n",
+	    );
+	my @ts = PublicInbox::MsgTime::msg_timestamp($mime->header_obj);
+	return \@ts;
+}
+
+# Verify that the parser sucks up the timezone for dates
+for (my $min = -1440; $min <= 1440; $min += 30) {
+	my $sign = ($min < 0) ? '-': '+';
+	my $h = abs(int($min / 60));
+	my $m = $min % 60;
+
+	my $ts_expect = 749520000 - ($min * 60);
+	my $tz_expect = sprintf('%s%02d%02d', $sign, $h, $m);
+	if ($tz_expect >= 1400 || $tz_expect <= -1400) {
+		$tz_expect = '+0000';
+	}
+	my $date = sprintf("Fri, 02 Oct 1993 00:00:00 %s%02d%02d",
+			   $sign, $h, $m);
+	my $result = datestamp($date);
+	is_deeply($result, [ $ts_expect, $tz_expect ]);
+}
+
+# Verify that the parser sucks up the timezone and for received timestamps
+for (my $min = -1440; $min <= 1440; $min += 30) {
+	my $sign = ($min < 0) ? '-' : '+';
+	my $h = abs(int($min / 60));
+	my $m = $min %60;
+
+	my $ts_expect = 1169471784 - ($min * 60);
+	my $tz_expect = sprintf('%s%02d%02d', $sign, $h, $m);
+	if ($tz_expect >= 1400 || $tz_expect <= -1400) {
+		$tz_expect = '+0000';
+	}
+	my $received = sprintf('Mon, 22 Jan 2007 13:16:24 %s%02d%02d',
+			       $sign, $h, $m);
+	is_deeply(timestamp($received), [ $ts_expect, $tz_expect ]);
+}
+
+is_deeply(datestamp('Wed, 13 Dec 2006 10:26:38 +1'), [1166001998, '+0100']);
+is_deeply(datestamp('Fri, 3 Feb 2006 18:11:22 -00'), [1138990282, '+0000']);
+is_deeply(datestamp('Thursday, 20 Feb 2003 01:14:34 +000'), [1045703674, '+0000']);
+is_deeply(datestamp('Fri, 28 Jun 2002 12:54:40 -700'), [1025294080, '-0700']);
+is_deeply(datestamp('Sat, 12 Jan 2002 12:52:57 -200'), [1010847177, '-0200']);
+is_deeply(datestamp('Mon, 05 Nov 2001 10:36:16 -800'), [1004985376, '-0800']);
+
+done_testing();
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] MsgTime.pm: Use strptime to compute the time zone
  2018-07-06 21:32       ` [PATCH] MsgTime.pm: Use strptime to compute the time zone ebiederm
@ 2018-07-06 22:22         ` Eric Wong
  2018-07-07 18:18           ` ebiederm
  2018-07-07 18:22           ` [PATCH] Import: Don't copy nulls from emails into git ebiederm
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Wong @ 2018-07-06 22:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

Thanks, applied.  I wasn't convinced the problem was worth
the effort on my part, but clearly you were, so thank you :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] MsgTime.pm: Use strptime to compute the time zone
  2018-07-06 22:22         ` Eric Wong
@ 2018-07-07 18:18           ` ebiederm
  2018-07-07 18:22           ` [PATCH] Import: Don't copy nulls from emails into git ebiederm
  1 sibling, 0 replies; 13+ messages in thread
From: ebiederm @ 2018-07-07 18:18 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Thanks, applied.  I wasn't convinced the problem was worth
> the effort on my part, but clearly you were, so thank you :)

git fsck errors and generally generating bad git trees just rubs me the
wrong way.

Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] Import: Don't copy nulls from emails into git
  2018-07-06 22:22         ` Eric Wong
  2018-07-07 18:18           ` ebiederm
@ 2018-07-07 18:22           ` ebiederm
  2018-07-08  0:07             ` Eric Wong
  1 sibling, 1 reply; 13+ messages in thread
From: ebiederm @ 2018-07-07 18:22 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


Recently I ran git --git-dir=lkml/git/1.git fsck
and it reported:
> warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body

Which I found quite scary.  Nulls in the wrong place have a bad tendency
to make programs misbehave.

It turns out someone had placed "=?iso-8859-1?q?=00?=" at the end of
their subject line.  Which is the mime encoding for NULL.  Email::Mime
had correctly decoded the header, and then public-inbox had simply
copied the contents of the header into the subject line of the git
commit.

To prevent that from causing problems replace nulls in such subject
lines with spaces.

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
---
 lib/PublicInbox/Import.pm |  2 ++
 t/nulsubject.t            | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)
 create mode 100644 t/nulsubject.t

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 250a2db31e97..8c1819209e58 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -405,6 +405,8 @@ sub add {
 		print $w "reset $ref\n" or wfail;
 	}
 
+	# Mime decoding can create nulls replace them with spaces to protect git
+	$subject =~ s/\0/ /;
 	utf8::encode($subject);
 	print $w "commit $ref\nmark :$commit\n",
 		"author $name <$email> $author_time_raw\n",
diff --git a/t/nulsubject.t b/t/nulsubject.t
new file mode 100644
index 000000000000..bb05be8589e7
--- /dev/null
+++ b/t/nulsubject.t
@@ -0,0 +1,33 @@
+# Copyright (C) 2016-2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use File::Temp qw/tempdir/;
+
+use_ok 'PublicInbox::Import';
+use_ok 'PublicInbox::Git';
+my $tmpdir = tempdir('pi-nulsubject-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+my $git_dir = "$tmpdir/a.git";
+
+{
+	is(system(qw(git init -q --bare), $git_dir), 0, 'git init ok');
+	my $git = PublicInbox::Git->new($git_dir);
+	my $im = PublicInbox::Import->new($git, 'testbox', 'test@example');
+	$im->add(Email::MIME->create(
+		header => [
+			From => 'a@example.com',
+			To => 'b@example.com',
+			'Content-Type' => 'text/plain',
+			Subject => ' A subject line with a null =?iso-8859-1?q?=00?= see!',
+			'Message-ID' => '<null-test.a@example.com>',
+		],
+		body => "hello world\n",
+	));
+	$im->done;
+	is(system(qw(git --git-dir), $git_dir, 'fsck', '--strict'), 0, 'git fsck ok');
+}
+
+done_testing();
+
+1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Import: Don't copy nulls from emails into git
  2018-07-07 18:22           ` [PATCH] Import: Don't copy nulls from emails into git ebiederm
@ 2018-07-08  0:07             ` Eric Wong
  2018-07-08  1:52               ` ebiederm
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2018-07-08  0:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> 
> Recently I ran git --git-dir=lkml/git/1.git fsck
> and it reported:
> > warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body
> 
> Which I found quite scary.  Nulls in the wrong place have a bad tendency
> to make programs misbehave.

Thanks.  In this case, it's a bit weird for fast-import to
accept this...

> +	# Mime decoding can create nulls replace them with spaces to protect git
> +	$subject =~ s/\0/ /;

\0 could appear more than once, so it needs a 'g', at least.
I squashed in a change to use "tr/\0/ /" instead
(tr// should be faster)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Import: Don't copy nulls from emails into git
  2018-07-08  0:07             ` Eric Wong
@ 2018-07-08  1:52               ` ebiederm
  0 siblings, 0 replies; 13+ messages in thread
From: ebiederm @ 2018-07-08  1:52 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> 
>> Recently I ran git --git-dir=lkml/git/1.git fsck
>> and it reported:
>> > warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body
>> 
>> Which I found quite scary.  Nulls in the wrong place have a bad tendency
>> to make programs misbehave.
>
> Thanks.  In this case, it's a bit weird for fast-import to
> accept this...
>
>> +	# Mime decoding can create nulls replace them with spaces to protect git
>> +	$subject =~ s/\0/ /;
>
> \0 could appear more than once, so it needs a 'g', at least.
> I squashed in a change to use "tr/\0/ /" instead
> (tr// should be faster)

Thank you.

I had thought of doing that but I forgot to come back and see if there
were any tweaks like that I needed to do.

Sigh. My perl is currently quite rusty.

Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-05 23:13 ` Eric Wong
  2018-07-06  0:36   ` ebiederm
@ 2018-07-12 18:31   ` Konstantin Ryabitsev
  2018-07-12 22:19     ` ebiederm
  2018-07-12 22:29     ` Eric Wong
  1 sibling, 2 replies; 13+ messages in thread
From: Konstantin Ryabitsev @ 2018-07-12 18:31 UTC (permalink / raw)
  To: Eric Wong; +Cc: Eric W. Biederman, meta

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Thu, Jul 05, 2018 at 11:13:46PM +0000, Eric Wong wrote:
>"Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> It looks like public-inbox has some challenges when importing some
>> questionable emails.  The import of lkml has resulted in several commits
>> with bad dates that git fsck complains about.  I have previously
>> reported this to Konstantin Ryabitsev who maintains kernel.org but since
>> I have not seen any discussion of this I thought I should report it
>> directly here as well.
>
>Thanks for bringing this up publically.
>
>Yes, I early during v2 development I noticed old mails had some
>-1400 timezone values (but the furthest is -1200).  I opted to
>attempt to preserve the wonky timezones since fast-import
>happily accepts -1400 and I didn't anticipate problems...

So, I can fix those in the archives, but this obviously requires 
rebasing the whole repo, and I'm not sure what kind of impact that would 
have. I'm assuming it's not sufficient to just fix the git repo, as all 
commit IDs after the modified commit are going to be different -- so 
additional changes to sqlite and xapian dbs would be required?

-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-12 18:31   ` Warnings from git fsck after lkml import Konstantin Ryabitsev
@ 2018-07-12 22:19     ` ebiederm
  2018-07-12 22:29     ` Eric Wong
  1 sibling, 0 replies; 13+ messages in thread
From: ebiederm @ 2018-07-12 22:19 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Eric Wong, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

> On Thu, Jul 05, 2018 at 11:13:46PM +0000, Eric Wong wrote:
>>"Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>> It looks like public-inbox has some challenges when importing some
>>> questionable emails.  The import of lkml has resulted in several commits
>>> with bad dates that git fsck complains about.  I have previously
>>> reported this to Konstantin Ryabitsev who maintains kernel.org but since
>>> I have not seen any discussion of this I thought I should report it
>>> directly here as well.
>>
>>Thanks for bringing this up publically.
>>
>>Yes, I early during v2 development I noticed old mails had some
>>-1400 timezone values (but the furthest is -1200).  I opted to
>>attempt to preserve the wonky timezones since fast-import
>>happily accepts -1400 and I didn't anticipate problems...
>
> So, I can fix those in the archives, but this obviously requires
> rebasing the whole repo, and I'm not sure what kind of impact that
> would have. I'm assuming it's not sufficient to just fix the git repo,
> as all commit IDs after the modified commit are going to be different
> -- so additional changes to sqlite and xapian dbs would be required?

Unless I am mistaken the cheap/clever version is to
- Rebuild the 3 .git trees.
- Notice that the object id's (aka sha1 hashes) of the emails remains
  the same.
- Use sqlite3 to update the meta table of msgmap.sqlite3

My currently msgmap.sqlite3 contains:
CREATE TABLE meta (key VARCHAR(32) PRIMARY KEY, val VARCHAR(255) NOT NULL);
/* No STAT tables available */
sqlite> select * from meta;
created_at|1530525399
last_xap15-6|c8f95c6728579303c200adbfb5469215da7e7836
last_xap15-5|31ed379430c456f90bdd172b223020c0e6d7cb8d
last_xap15-4|88294f6d487193f5984791ee81213a25130d0559
last_xap15-3|93d9eace2721494d8457c7f5f6de803c0d648172
last_xap15-2|d48078ceeec1f51313253a56ed3ba0eae7fde909
last_xap15-1|6b67b9f5e0cd82d3c734e6cdc44c1f722ab6fb6a
last_xap15-0|b67bf7f62c8125d67461cc6e7d1736ddc8844a18

Which matches the HEAD commits the lkml git trees.
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/0.git show --pretty=oneline | head  -1
b67bf7f62c8125d67461cc6e7d1736ddc8844a18 [-mm patch] drivers/firewire/: cleanups
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/1.git show --pretty=oneline | head  -1
6b67b9f5e0cd82d3c734e6cdc44c1f722ab6fb6a Re: [git patches] libata updates for 2.6.34
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/2.git show --pretty=oneline | head  -1
d48078ceeec1f51313253a56ed3ba0eae7fde909 Re: linux-next: Tree for Jan 10 (staging/sb105x)
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/3.git show --pretty=oneline | head  -1
93d9eace2721494d8457c7f5f6de803c0d648172 Re: randconfig bug: ARM/KVM link error in hyp_idmap section
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/4.git show --pretty=oneline | head  -1
88294f6d487193f5984791ee81213a25130d0559 Re: [PATCH 2/2] sdhci-of-arasan: Set controller to test mode when fails-without-test-cd is present
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/5.git show --pretty=oneline | head  -1
31ed379430c456f90bdd172b223020c0e6d7cb8d Re: [PATCH 0/2] of: change overlay apply input data from EDT to FDT
eric@x220:~/public-inbox/vger.kernel.org/linux-kernel-good$ git --git-dir=git/6.git show --pretty=oneline | head  -1
c8f95c6728579303c200adbfb5469215da7e7836 [PATCH] slimbus: stream: Fix htmldocs warnings

However all you have to do is ensure you preserve msgmap.sqlite3 and
public-inbox-index is capable of rebuilding everything else.


Eric


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Warnings from git fsck after lkml import
  2018-07-12 18:31   ` Warnings from git fsck after lkml import Konstantin Ryabitsev
  2018-07-12 22:19     ` ebiederm
@ 2018-07-12 22:29     ` Eric Wong
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Wong @ 2018-07-12 22:29 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Eric W. Biederman, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Jul 05, 2018 at 11:13:46PM +0000, Eric Wong wrote:
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> > > It looks like public-inbox has some challenges when importing some
> > > questionable emails.  The import of lkml has resulted in several commits
> > > with bad dates that git fsck complains about.  I have previously
> > > reported this to Konstantin Ryabitsev who maintains kernel.org but since
> > > I have not seen any discussion of this I thought I should report it
> > > directly here as well.
> > 
> > Thanks for bringing this up publically.
> > 
> > Yes, I early during v2 development I noticed old mails had some
> > -1400 timezone values (but the furthest is -1200).  I opted to
> > attempt to preserve the wonky timezones since fast-import
> > happily accepts -1400 and I didn't anticipate problems...
> 
> So, I can fix those in the archives, but this obviously requires rebasing
> the whole repo, and I'm not sure what kind of impact that would have. I'm
> assuming it's not sufficient to just fix the git repo, as all commit IDs
> after the modified commit are going to be different -- so additional changes
> to sqlite and xapian dbs would be required?

Yes, I think the internal "purge" and normal add operation
should take care of Xapian/SQLite changes.  NNTP serial numbers
will change and readers will redownload a few messages, though.

Personally, I wouldn't bother since it'd be disruptive to
existing clones and I don't consider them to be big enough
problems worth breaking changes if git itself doesn't complain
by default.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-05  5:40 Warnings from git fsck after lkml import ebiederm
2018-07-05 23:13 ` Eric Wong
2018-07-06  0:36   ` ebiederm
2018-07-06  3:47     ` ebiederm
2018-07-06 21:32       ` [PATCH] MsgTime.pm: Use strptime to compute the time zone ebiederm
2018-07-06 22:22         ` Eric Wong
2018-07-07 18:18           ` ebiederm
2018-07-07 18:22           ` [PATCH] Import: Don't copy nulls from emails into git ebiederm
2018-07-08  0:07             ` Eric Wong
2018-07-08  1:52               ` ebiederm
2018-07-12 18:31   ` Warnings from git fsck after lkml import Konstantin Ryabitsev
2018-07-12 22:19     ` ebiederm
2018-07-12 22:29     ` Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox