git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Thomas Bock <bockthom@cs.uni-saarland.de>
Cc: Derrick Stolee <derrickstolee@github.com>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org
Subject: [PATCH 2/3] parse_commit(): parse timestamp from end of line
Date: Sat, 22 Apr 2023 09:47:03 -0400	[thread overview]
Message-ID: <20230422134703.GB3942326@coredump.intra.peff.net> (raw)
In-Reply-To: <20230422134150.GA3516940@coredump.intra.peff.net>

To find the committer timestamp, we parse left-to-right looking for the
closing ">" of the email, and then expect the timestamp right after
that. But we've seen some broken cases in the wild where this fails, but
we _could_ find the timestamp with a little extra work. E.g.:

  Name <Name<email>> 123456789 -0500

This means that features that rely on the committer timestamp, like
--since or --until, will treat the commit as happening at time 0 (i.e.,
1970).

This is doubly confusing because the pretty-print parser learned to
handle these in 03818a4a94 (split_ident: parse timestamp from end of
line, 2013-10-14). So printing them via "git show", etc, makes
everything look normal, but --until, etc are still broken (despite the
fact that that commit explicitly mentioned --until!).

So let's use the same trick as 03818a4a94: find the end of the line, and
parse back to the final ">". In theory we could use split_ident_line()
here, but it's actually a bit more strict. In particular, it requires a
valid time-zone token, too. That should be present, of course, but we
wouldn't want to break --until for malformed cases that are working
currently.

We might want to teach split_ident_line() to become more lenient there,
but it would require checking its many callers (since right now they can
assume that if date_start is non-NULL, so is tz_start).

So for now we'll just reimplement the same trick in the commit parser.

The test is in t4212, which already covers similar cases, courtesy of
03818a4a94. We'll just adjust the broken commit to munge both the author
and committer timestamps. Note that we could match (author|committer)
here, but alternation can't be used portably in sed. Since we wouldn't
expect to see ">" except as part of an ident line, we can just match
that character on any line.

Signed-off-by: Jeff King <peff@peff.net>
---
This is more or less the same as what I posted earlier, but using
memchr() where appropriate (we could use memrchr(), too, but I don't
think it's portable enough).

Note that both before and after my series, there are cases where
parse_commit() is more lenient than split_ident_line(), because of the
time-zone thing. For example:

  committer name <email> 1234567890\n

will show as 1970, but --until, etc, will work as expected (so the
opposite of the case that started this thread). So I do think making
split_ident_line() more lenient may be worth doing, but I punted on that
for now. This series takes us in a strictly better direction with
respect to visible behavior, even if we might be able to clean up the
internals later by reusing code.

 commit.c               | 19 ++++++++++++-------
 t/t4212-log-corrupt.sh |  7 ++++++-
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/commit.c b/commit.c
index 6d844da9a6..ede810ac1c 100644
--- a/commit.c
+++ b/commit.c
@@ -95,6 +95,7 @@ struct commit *lookup_commit_reference_by_name(const char *name)
 static timestamp_t parse_commit_date(const char *buf, const char *tail)
 {
 	const char *dateptr;
+	const char *eol;
 
 	if (buf + 6 >= tail)
 		return 0;
@@ -106,16 +107,20 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
 		return 0;
 	if (memcmp(buf, "committer", 9))
 		return 0;
-	while (buf < tail && *buf++ != '>')
-		/* nada */;
-	if (buf >= tail)
+
+	/*
+	 * parse to end-of-line and then walk backwards, which
+	 * handles some malformed cases.
+	 */
+	eol = memchr(buf, '\n', tail - buf);
+	if (!eol)
 		return 0;
-	dateptr = buf;
-	while (buf < tail && *buf++ != '\n')
+	for (dateptr = eol; dateptr > buf && dateptr[-1] != '>'; dateptr--)
 		/* nada */;
-	if (buf >= tail)
+	if (dateptr == buf || dateptr == eol)
 		return 0;
-	/* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */
+
+	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index 8b5433ea74..af4b35ff56 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -9,7 +9,7 @@ test_expect_success 'setup' '
 	test_commit foo &&
 
 	git cat-file commit HEAD >ok.commit &&
-	sed "/^author /s/>/>-<>/" <ok.commit >broken_email.commit &&
+	sed "s/>/>-<>/" <ok.commit >broken_email.commit &&
 
 	git hash-object --literally -w -t commit broken_email.commit >broken_email.hash &&
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
@@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' '
 	test_must_be_empty actual.err
 '
 
+test_expect_success '--until handles broken email' '
+	git rev-list --until=1980-01-01 broken_email >actual &&
+	test_must_be_empty actual
+'
+
 munge_author_date () {
 	git cat-file commit "$1" >commit.orig &&
 	sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge &&
-- 
2.40.0.653.g15ca972062


  parent reply	other threads:[~2023-04-22 13:47 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 11:37 Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Thomas Bock
2023-04-15  8:52 ` Jeff King
2023-04-15  8:59   ` Jeff King
2023-04-15 14:10   ` Kristoffer Haugsbakk
2023-04-17  5:40     ` Jeff King
2023-04-17  6:20       ` Kristoffer Haugsbakk
2023-04-17  7:41         ` Jeff King
2023-04-27 22:32           ` Kristoffer Haugsbakk
2023-04-17  9:51   ` Junio C Hamano
2023-04-18  4:12     ` Jeff King
2023-04-18 14:02       ` Derrick Stolee
2023-04-21 14:51         ` Thomas Bock
2023-04-22 13:41           ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-22 13:42             ` [PATCH 1/3] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-22 13:47             ` Jeff King [this message]
2023-04-24 17:05               ` [PATCH 2/3] parse_commit(): parse timestamp from end of line Junio C Hamano
2023-04-25  5:23                 ` Jeff King
2023-04-24 16:39             ` [PATCH 0/3] fixing some parse_commit() timestamp corner cases Junio C Hamano
2023-04-25  5:52             ` [PATCH v2 " Jeff King
2023-04-25  5:54               ` Jeff King
2023-04-25  5:54               ` [PATCH v2 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-25  5:54               ` [PATCH v2 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-25  5:54               ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-25 10:11                 ` Phillip Wood
2023-04-25 16:06                   ` Junio C Hamano
2023-04-26 11:36                     ` Jeff King
2023-04-26 15:32                       ` Junio C Hamano
2023-04-27  8:13                         ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27  8:14                           ` [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Jeff King
2023-04-27  8:14                           ` [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Jeff King
2023-04-27  8:17                           ` [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Jeff King
2023-04-27 10:11                             ` Phillip Wood
2023-04-27 11:55                               ` Phillip Wood
2023-04-27 16:46                                 ` Jeff King
2023-04-27 16:20                               ` Junio C Hamano
2023-04-27 16:55                                 ` Jeff King
2023-04-27 16:25                             ` Junio C Hamano
2023-04-27 16:57                               ` Jeff King
2023-04-27  8:17                           ` [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-27  8:18                           ` [PATCH v3 0/4] fixing some parse_commit() timestamp corner cases Jeff King
2023-04-27 16:32                           ` Junio C Hamano
2023-04-26 14:06                     ` [PATCH v2 3/4] parse_commit(): handle broken whitespace-only timestamp Phillip Wood
2023-04-26 14:31                       ` Andreas Schwab
2023-04-26 14:44                         ` Phillip Wood
2023-04-25  5:55               ` [PATCH v2 4/4] parse_commit(): describe more date-parsing failure modes Jeff King
2023-04-22 13:52         ` Weird behavior of 'git log --before' or 'git log --date-order': Commits from 2011 are treated to be before 1980 Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230422134703.GB3942326@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=bockthom@cs.uni-saarland.de \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).