git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Patching Git to handle dates before the Unix epoch
@ 2019-09-10 14:14 Diomidis Spinellis
  2019-09-10 16:08 ` Randall S. Becker
  2019-09-10 16:26 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Diomidis Spinellis @ 2019-09-10 14:14 UTC (permalink / raw)
  To: git

As people use Git to create synthetic commits of code written in the 
past [1,2] it becomes important to handle dates before the Unix epoch 
(1/1/1970).  I see that modern C libraries, Unix kernels, and tools can 
handle such dates.  However Git seems to mishandle such dates in several 
places, such as in date.c [3,4].  I'm planning to work on a fix, but 
before I embark on this I have a few questions.

- Do you see any reasons that may prevent the acceptance of such a patch?
- Can you think of any non-obvious issues (e.g. backward compatibility, 
switch to the Gregorian calendar) I should be aware of?
- Should I submit changes as a bug fix on the maint branch or as a new 
feature on master?

[1] https://github.com/dspinellis/unix-history-repo (original dates)
[2] https://github.com/chrislgarry/Apollo-11 (wrong dates)
[3] https://github.com/git/git/blob/master/date.c#L21
[4] https://github.com/git/git/blob/master/date.c#L776

Diomidis Spinellis - https://www.spinellis.gr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Patching Git to handle dates before the Unix epoch
  2019-09-10 14:14 Patching Git to handle dates before the Unix epoch Diomidis Spinellis
@ 2019-09-10 16:08 ` Randall S. Becker
  2019-09-10 16:31   ` Jeff King
  2019-09-10 16:26 ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Randall S. Becker @ 2019-09-10 16:08 UTC (permalink / raw)
  To: 'Diomidis Spinellis', git

On September 10, 2019 10:15 AM, Diomidis Spinellis wrote:
> As people use Git to create synthetic commits of code written in the past
> [1,2] it becomes important to handle dates before the Unix epoch
> (1/1/1970).  I see that modern C libraries, Unix kernels, and tools can handle
> such dates.  However Git seems to mishandle such dates in several places,
> such as in date.c [3,4].  I'm planning to work on a fix, but before I embark on
> this I have a few questions.
> 
> - Do you see any reasons that may prevent the acceptance of such a patch?
> - Can you think of any non-obvious issues (e.g. backward compatibility,
> switch to the Gregorian calendar) I should be aware of?
> - Should I submit changes as a bug fix on the maint branch or as a new
> feature on master?

My suggestion is a new feature as a patch. See other contributions. While you're at this, especially given how extensive this may be given the time_t references, it might be useful to examine the end of epoch concerns as well. 2036 is not that far off and some of the repositories I am managing will likely last beyond the Unix date rollover. There are no time64_t to get us to 2106, but even that is probably not sufficient (thinking beyond my own expiry date). The concept of unlimited date ranges is intriguing if we are going to store broader range artifacts in git, like signatures of physical core samples and use the carbon dating in git, or some Sci-Fi conceptual commit on old 1701D. So if I had a preference, it would be to store an extensible date range going from the Big Bang to the Big Crunch (or beyond), but that is excessive.

Sadly my platform's own date ranges are 64-bit microsecond that range from year 1 (AD) - 10000 (AD) and I can't map easily to Unix dates.

Just my $0.02,
Randall


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patching Git to handle dates before the Unix epoch
  2019-09-10 14:14 Patching Git to handle dates before the Unix epoch Diomidis Spinellis
  2019-09-10 16:08 ` Randall S. Becker
@ 2019-09-10 16:26 ` Jeff King
  2019-09-10 17:19   ` Diomidis Spinellis
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2019-09-10 16:26 UTC (permalink / raw)
  To: Diomidis Spinellis; +Cc: git

On Tue, Sep 10, 2019 at 05:14:53PM +0300, Diomidis Spinellis wrote:

> As people use Git to create synthetic commits of code written in the past
> [1,2] it becomes important to handle dates before the Unix epoch (1/1/1970).
> I see that modern C libraries, Unix kernels, and tools can handle such
> dates.  However Git seems to mishandle such dates in several places, such as
> in date.c [3,4].  I'm planning to work on a fix, but before I embark on this
> I have a few questions.

Coincidentally I was looking into this yesterday. See below.

> - Do you see any reasons that may prevent the acceptance of such a patch?

No, I think this is where we want to go. One of the long-standing
blockers was that timestamps were treated as "unsigned long" all over
the code base. That was cleaned up into a single date type around
dddbad728c (timestamp_t: a new data type for timestamps, 2017-04-26).

So there's definitely still work to be done before switching that to a
signed type, but most of the far-reaching changes have already been
done.

> - Can you think of any non-obvious issues (e.g. backward compatibility,
> switch to the Gregorian calendar) I should be aware of?

The big question is: what will existing implementations do with a commit
object that contains a negative timestamp?

There are a few answers I found (at least for existing versions of Git;
I didn't check other tools):

  - git-log and friends will realize they can't parse it and quietly
    rewrite it to "0 +0000" (e.g., when formatting human-readable author
    dates). That's pretty reasonable.

  - fsck will complain, which may cause servers to reject objects if
    they use transfer.fsckObjects. Once there are patches, presumably
    most hosting sites would apply them relatively quickly, but it's
    likely to cause some friction. OTOH, you only see that friction if
    you're working with a history that has such negative timestamps.

  - I'm not sure what we'd do for commit timestamps, which are still
    stored as 32-bit unsigned integers for the purposes of traversal
    (and ditto in commit graphs). IMHO the correct thing would be having
    modern commit timestamps and "real" author timestamps. I'm not sure
    if fsck should enforce that or not.

> - Should I submit changes as a bug fix on the maint branch or as a new
> feature on master?

While it is a bug fix, I think it's a big enough change that it should
be part of a major release. So base it on master.

Here's how far I got working on this yesterday. It's quite messy, but
seems to work at least for basic tests like:

  $ git commit --date='@-6105998402 -0500' --allow-empty -m foo
  [master f83702e] foo
   Date: Thu Jul 4 12:19:58 1776 -0500

  $ git show
  commit f83702e850366c0a9a6dc5a5fb70d21072cf40e5 (HEAD -> master)
  Author: Jeff King <peff@peff.net>
  Date:   Thu Jul 4 12:19:58 1776 -0500

      foo

It obviously needs split up into individual patches with each bit
explained. The most glaring problem is that it switches our custom
tm_to_time_t() to use timegm(), which isn't portable. We can probably
pull a fallback implementation from glibc or elsewhere (I've seen a few
floating around the net, and they're less horrible than you might think
because they don't have to deal with timezones).

There are probably more corner cases to be poked at and found. Notably
this _doesn't_ work:

  $ git commit --date='1776-07-04 12:19:58' ...

because we're too picky in our parsing. There might be some improvement
possible there, at least for unambiguous iso8601-looking cases
(approxidate will probably always have use heuristics that assume a
recent-ish date).

---
 Makefile          |  5 ++---
 date.c            | 42 ++++++++++++++++++------------------------
 git-compat-util.h |  9 +++++----
 ident.c           |  4 +++-
 4 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/Makefile b/Makefile
index ad71ae1219..41cdb4bd8c 100644
--- a/Makefile
+++ b/Makefile
@@ -2525,8 +2525,7 @@ endif
 ## Git repository (not a tarball extract), (2) any local modifications
 ## will be lost.
 ## Gettext tools cannot work with our own custom PRItime type, so
-## we replace PRItime with PRIuMAX.  We need to update this to
-## PRIdMAX if we switch to a signed type later.
+## we replace PRItime with PRIdMAX.
 
 po/git.pot: $(GENERATED_H) FORCE
 	# All modifications will be reverted at the end, so we do not
@@ -2535,7 +2534,7 @@ po/git.pot: $(GENERATED_H) FORCE
 
 	@for s in $(LOCALIZED_C) $(LOCALIZED_SH) $(LOCALIZED_PERL); \
 	do \
-		sed -e 's|PRItime|PRIuMAX|g' <"$$s" >"$$s+" && \
+		sed -e 's|PRItime|PRIdMAX|g' <"$$s" >"$$s+" && \
 		cat "$$s+" >"$$s" && rm "$$s+"; \
 	done
 
diff --git a/date.c b/date.c
index 8126146c50..52e54e3e2a 100644
--- a/date.c
+++ b/date.c
@@ -7,27 +7,14 @@
 #include "cache.h"
 
 /*
- * This is like mktime, but without normalization of tm_wday and tm_yday.
+ * This is like mktime, but without normalization of tm_wday and tm_yday,
+ * and always operating in UTC, not the local timezone.
  */
-static time_t tm_to_time_t(const struct tm *tm)
+static time_t tm_to_time_t(const struct tm *tm_const)
 {
-	static const int mdays[] = {
-	    0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334
-	};
-	int year = tm->tm_year - 70;
-	int month = tm->tm_mon;
-	int day = tm->tm_mday;
-
-	if (year < 0 || year > 129) /* algo only works for 1970-2099 */
-		return -1;
-	if (month < 0 || month > 11) /* array bounds */
-		return -1;
-	if (month < 2 || (year + 2) % 4)
-		day--;
-	if (tm->tm_hour < 0 || tm->tm_min < 0 || tm->tm_sec < 0)
-		return -1;
-	return (year * 365 + (year + 1) / 4 + mdays[month] + day) * 24*60*60UL +
-		tm->tm_hour * 60*60 + tm->tm_min * 60 + tm->tm_sec;
+	struct tm tm = *tm_const;
+	/* this is a GNUism; we need to provide a fallback */
+	return timegm(&tm);
 }
 
 static const char *month_names[] = {
@@ -51,8 +38,7 @@ static time_t gm_time_t(timestamp_t time, int tz)
 		if (unsigned_add_overflows(time, minutes * 60))
 			die("Timestamp+tz too large: %"PRItime" +%04d",
 			    time, tz);
-	} else if (time < -minutes * 60)
-		die("Timestamp before Unix epoch: %"PRItime" %04d", time, tz);
+	}
 	time += minutes * 60;
 	if (date_overflows(time))
 		die("Timestamp too large for this system: %"PRItime, time);
@@ -773,10 +759,11 @@ static int match_object_header_date(const char *date, timestamp_t *timestamp, in
 	timestamp_t stamp;
 	int ofs;
 
-	if (*date < '0' || '9' < *date)
+	if (*date != '-' && (*date < '0' || '9' < *date))
 		return -1;
+	errno = 0;
 	stamp = parse_timestamp(date, &end, 10);
-	if (*end != ' ' || stamp == TIME_MAX || (end[1] != '+' && end[1] != '-'))
+	if (*end != ' ' || errno == ERANGE || (end[1] != '+' && end[1] != '-'))
 		return -1;
 	date = end + 2;
 	ofs = strtol(date, &end, 10);
@@ -841,6 +828,13 @@ int parse_date_basic(const char *date, timestamp_t *timestamp, int *offset)
 		date += match;
 	}
 
+	if (tm.tm_year < 0 ||
+	    tm.tm_mon < 0 || tm.tm_mon > 11 ||
+	    tm.tm_hour < 0 || tm.tm_hour > 23 ||
+	    tm.tm_min < 0 || tm.tm_min > 59 ||
+	    tm.tm_sec < 0 || tm.tm_sec > 59)
+		return -1;
+
 	/* do not use mktime(), which uses local timezone, here */
 	*timestamp = tm_to_time_t(&tm);
 	if (*timestamp == -1)
@@ -1322,7 +1316,7 @@ int date_overflows(timestamp_t t)
 	time_t sys;
 
 	/* If we overflowed our timestamp data type, that's bad... */
-	if ((uintmax_t)t >= TIME_MAX)
+	if (t >= TIME_MAX || t <= TIME_MIN)
 		return 1;
 
 	/*
diff --git a/git-compat-util.h b/git-compat-util.h
index 83be89de0a..9a93d43429 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -340,10 +340,11 @@ char *gitdirname(char *);
 #define PRIo32 "o"
 #endif
 
-typedef uintmax_t timestamp_t;
-#define PRItime PRIuMAX
-#define parse_timestamp strtoumax
-#define TIME_MAX UINTMAX_MAX
+typedef intmax_t timestamp_t;
+#define PRItime PRIdMAX
+#define parse_timestamp strtoimax
+#define TIME_MAX INTMAX_MAX
+#define TIME_MIN INTMAX_MIN
 
 #ifndef PATH_SEP
 #define PATH_SEP ':'
diff --git a/ident.c b/ident.c
index e666ee4e59..2734bf890d 100644
--- a/ident.c
+++ b/ident.c
@@ -322,10 +322,12 @@ int split_ident_line(struct ident_split *split, const char *line, int len)
 	if (line + len <= cp)
 		goto person_only;
 	split->date_begin = cp;
+	if (*cp == '-')
+		cp++;
 	span = strspn(cp, "0123456789");
 	if (!span)
 		goto person_only;
-	split->date_end = split->date_begin + span;
+	split->date_end = cp + span;
 	for (cp = split->date_end; cp < line + len && isspace(*cp); cp++)
 		;
 	if (line + len <= cp || (*cp != '+' && *cp != '-'))
-- 
2.23.0.663.gbe3d117559


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Patching Git to handle dates before the Unix epoch
  2019-09-10 16:08 ` Randall S. Becker
@ 2019-09-10 16:31   ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2019-09-10 16:31 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Diomidis Spinellis', git

On Tue, Sep 10, 2019 at 12:08:34PM -0400, Randall S. Becker wrote:

> My suggestion is a new feature as a patch. See other contributions.
> While you're at this, especially given how extensive this may be given
> the time_t references, it might be useful to examine the end of epoch
> concerns as well. 2036 is not that far off and some of the
> repositories I am managing will likely last beyond the Unix date
> rollover. There are no time64_t to get us to 2106, but even that is
> probably not sufficient (thinking beyond my own expiry date). The
> concept of unlimited date ranges is intriguing if we are going to
> store broader range artifacts in git, like signatures of physical core
> samples and use the carbon dating in git, or some Sci-Fi conceptual
> commit on old 1701D. So if I had a preference, it would be to store an
> extensible date range going from the Big Bang to the Big Crunch (or
> beyond), but that is excessive.

Internally we now use a uintmax_t (with a resolution of 1-second) for
parsing and storing most timestamps, which should presumably be at least
64-bit on most systems. We do convert to time_t at various points,
recognizing truncation and switching that to a sentinel value (so at
least you get "Jan 1 1970" instead of some random wrapped time).

So I think we should behave sanely in all cases (at least going forward
from 1970 for now), and it's a quality-of-implementation issue whether
any given system can pretty-print specific dates.

Switching that to intmax_t should get us pretty good coverage in the
opposite direction, and anybody with a 64-bit signed time_t on their
system will generally be happy.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patching Git to handle dates before the Unix epoch
  2019-09-10 16:26 ` Jeff King
@ 2019-09-10 17:19   ` Diomidis Spinellis
  2019-09-13  5:12     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Diomidis Spinellis @ 2019-09-10 17:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 10-Sep-19 19:26, Jeff King wrote:
> On Tue, Sep 10, 2019 at 05:14:53PM +0300, Diomidis Spinellis wrote:
>> As people use Git to create synthetic commits of code written in the past
>> [1,2] it becomes important to handle dates before the Unix epoch (1/1/1970).
>> I see that modern C libraries, Unix kernels, and tools can handle such
>> dates.  However Git seems to mishandle such dates in several places, such as
>> in date.c [3,4].  I'm planning to work on a fix, but before I embark on this
>> I have a few questions.
[...]
>> - Can you think of any non-obvious issues (e.g. backward compatibility,
>> switch to the Gregorian calendar) I should be aware of?
> 
> The big question is: what will existing implementations do with a commit
> object that contains a negative timestamp?

Negative timestamps can already be created, because some Git libraries 
can create such objects, and one can also create them by hand; see 
http://git.661346.n2.nabble.com/Back-dating-commits-way-back-for-constitution-git-tp5365303p5367657.html

Consequently, I don't think that worrying about how existing clients 
will handle such timestamps should be a big issue, as long as we don't 
break the handling of modern dates.

 > Coincidentally I was looking into this yesterday. See below.
[...]
> Here's how far I got working on this yesterday. It's quite messy, but
> seems to work at least for basic tests like:

Amazing!  Given that you have made significant progress, I think it's 
best to leave it to you to complete it, right?

Diomidis

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Patching Git to handle dates before the Unix epoch
  2019-09-10 17:19   ` Diomidis Spinellis
@ 2019-09-13  5:12     ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2019-09-13  5:12 UTC (permalink / raw)
  To: Diomidis Spinellis; +Cc: git

On Tue, Sep 10, 2019 at 08:19:49PM +0300, Diomidis Spinellis wrote:

> > > - Can you think of any non-obvious issues (e.g. backward compatibility,
> > > switch to the Gregorian calendar) I should be aware of?
> > 
> > The big question is: what will existing implementations do with a commit
> > object that contains a negative timestamp?
> 
> Negative timestamps can already be created, because some Git libraries can
> create such objects, and one can also create them by hand; see http://git.661346.n2.nabble.com/Back-dating-commits-way-back-for-constitution-git-tp5365303p5367657.html
> 
> Consequently, I don't think that worrying about how existing clients will
> handle such timestamps should be a big issue, as long as we don't break the
> handling of modern dates.

Yes, it's easy to create lots of things with git-hash-object, but that
doesn't mean we should be encouraging people (yet). :) That said, I
think the handling here is pretty favorable: if you're not using very
old timestamps, you won't see any problem. And if you do use them, the
old presentation code handles it pretty well. We'd probably want to warn
people that older versions of Git may complain when fsck-ing.

> > Here's how far I got working on this yesterday. It's quite messy, but
> > seems to work at least for basic tests like:
> 
> Amazing!  Given that you have made significant progress, I think it's best
> to leave it to you to complete it, right?

I was afraid you'd say that. :)

I did spend a few more minutes looking into it after the last email.
There's a rabbit hole of handling timestamp parsing where many cases are
not actually checking appropriately for errors. So I think we'd want to
deal with those sites, too, while we're touching parse_timestamp().

It's not clear to me yet how deep the rabbit hole of little fixes goes
(or how deep we need to go for this particular feature). So I hope to
return to it when I have time, but if you or anybody else would like to
poke at it in the meantime, be my guest.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-13  5:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 14:14 Patching Git to handle dates before the Unix epoch Diomidis Spinellis
2019-09-10 16:08 ` Randall S. Becker
2019-09-10 16:31   ` Jeff King
2019-09-10 16:26 ` Jeff King
2019-09-10 17:19   ` Diomidis Spinellis
2019-09-13  5:12     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).