git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Thomas Bock <bockthom@cs.uni-saarland.de>
Cc: git@vger.kernel.org
Subject: Re: Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980
Date: Sat, 15 Apr 2023 04:52:07 -0400	[thread overview]
Message-ID: <20230415085207.GA656008@coredump.intra.peff.net> (raw)
In-Reply-To: <7728e059-d58d-cce7-c011-fbc16eb22fb9@cs.uni-saarland.de>

On Fri, Apr 14, 2023 at 01:37:57PM +0200, Thomas Bock wrote:

> Example to reproduce:
> 
> git clone git@github.com:LibreOffice/core.git libreoffice
> cd libreoffice
> git log --no-merges --before="1980-01-01T00:00:00+0000"
> --format=%H,%ct,%ci,%ad

Thanks for providing a clear example and reproduction recipe. It's an
interesting case. The commits _are_ malformed, but only slightly. For
example:

  $ git cat-file commit d3b03514dde317473db0d247f21405b5db6a727e
  tree c7526375c71327a195714e9e325b66a9ad013d74
  parent 61099481271709723469421181f65e6219cbc271
  author Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100
  committer Andre Fischer<andre.f.fischer <Andre Fischer<andre.f.fischer@oracle.com>> 1297247749 +0100

There's an extra weird set of angle brackets in both the author and
committer lines.

But here's the really quirky part: there actually two parsers being used
in Git here. One is in parse_commit_buffer(), which does the minimum to
fill in the fields of a "struct commit": parents, tree, and committer
timestamp.  That parser calls parse_commit_date(), which finds the first
">" and then tries to parse "> 1297247749 +0100" as a timestamp, which
fails. So it uses the sentinel value "0" (aka 1970-01-01). And that's
what gets used for "--before" (and "--since", and things like queue
order for showing commits).

But then when we actually _display_ the commit, we have a whole
pretty-printing system. There we usually end up in split_ident_line(),
which is a bit more forgiving, due to 03818a4a94 (split_ident: parse
timestamp from end of line, 2013-10-14). It finds the trailing ">" from
the right-hand side, which in this case yields the correct timestamp.

We could probably use the same trick in parse_commit_date(), something
like:

diff --git a/commit.c b/commit.c
index 6d844da9a6..72d21a2bb8 100644
--- a/commit.c
+++ b/commit.c
@@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;
 
 	if (buf + 6 >= tail)
 		return 0;
@@ -106,16 +107,22 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
+
+	/*
+	 * parse to end-of-line and then walk backwards, which
+	 * handles some malformed cases. Alternatively, once
+	 * we have eol, we could just call split_ident_line()
+	 */
+	eol = buf;
+	while (eol < tail && *eol++ != '\n')
 		/* nada */;
-	if (buf >= tail)
+	if (eol >= tail)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
 		/* nada */;
-	if (buf >= tail)
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 

which makes your case behave as expected. It may make sense to use
split_ident_line(), which I think has a few other rules (it actually
checks for something that looks like numbers, for instance).

There may be also cases where the two diverge. Obviously having two
parsers isn't ideal. I think sharing the code may involve a lot of work,
though. The pretty-print parser is interested in pulling out more
information, and is less focused on performance. Parsing commits is
traditionally a hot path, as we historically had to parse every commit,
even if we weren't showing it (including non-log operations like
computing merge bases, reachability, and so forth).

But that may not matter so much. One, we already inflate the whole
commit object, not just the header. So even if we spend a few extra
instructions on parsing, it may not be noticeable. And two, these days
we often cache commit metadata in the commit-graph files, which avoids
loading the commit message entirely (and thus this parsing) for most
operations.

Of course it may also be reasonable to consider this mystery solved and
leave the code as-is. These objects _are_ malformed. I note they're all
at least a decade old, and most of them are from a handful of committers
(who presumably had misconfigured idents for a while). These days I
think Git wouldn't allow them (I don't think it rejects the commit, but
it does screen out the syntactically bogus characters).

-Peff

  reply	other threads:[~2023-04-15  8:52 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock
2023-04-15  8:52 ` Jeff King [this message]
2023-04-15  8:59   ` Jeff King
2023-04-15 14:10   ` Kristoffer Haugsbakk
2023-04-17  5:40     ` Jeff King
2023-04-17  6:20       ` Kristoffer Haugsbakk
2023-04-17  7:41         ` Jeff King
2023-04-27 22:32           ` Kristoffer Haugsbakk
2023-04-17  9:51   ` Junio C Hamano
2023-04-18  4:12     ` Jeff King
2023-04-18 14:02       ` Derrick Stolee
2023-04-21 14:51         ` Thomas Bock
2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-22 13:47             ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Jeff King
2023-04-24 17:05               ` Junio C Hamano
2023-04-25  5:23                 ` Jeff King
2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
2023-04-25  5:52             ` [PATCH v2 " Jeff King
2023-04-25  5:54               ` Jeff King
2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-25 10:11                 ` Phillip Wood
2023-04-25 16:06                   ` Junio C Hamano
2023-04-26 11:36                     ` Jeff King
2023-04-26 15:32                       ` Junio C Hamano
2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-27 10:11                             ` Phillip Wood
2023-04-27 11:55                               ` Phillip Wood
2023-04-27 16:46                                 ` Jeff King
2023-04-27 16:20                               ` Junio C Hamano
2023-04-27 16:55                                 ` Jeff King
2023-04-27 16:25                             ` Junio C Hamano
2023-04-27 16:57                               ` Jeff King
2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27 16:32                           ` Junio C Hamano
2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
2023-04-26 14:31                       ` Andreas Schwab
2023-04-26 14:44                         ` Phillip Wood
2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230415085207.GA656008@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=bockthom@cs.uni-saarland.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).