[PATCH] split_ident: parse timestamp from end of line

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] split_ident: parse timestamp from end of line
@ 2013-10-14 20:27 Jeff King
  2013-10-14 22:25 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2013-10-14 20:27 UTC (permalink / raw)
  To: git

Split_ident currently parses left to right. Given this
input:

  Your Name <email@example.com> 123456789 -0500\n

We assume the name starts the line and runs until the first
"<".  That starts the email address, which runs until the
first ">".  Everything after that is assumed to be the
timestamp.

This works fine in the normal case, but is easily broken by
corrupted ident lines that contain an extra ">". Some
examples seen in the wild are:

  1. Name <email>-<> 123456789 -0500\n

  2. Name <email> <Name<email>> 123456789 -0500\n

  3. Name1 <email1>, Name2 <email2> 123456789 -0500\n

Currently each of these produces some email address (which
is not necessarily the one the user intended) and end up
with a NULL date (which is generally interpreted as the
epoch by "git log" and friends).

But in each case we could get the correct timestamp simply
by parsing from the right-hand side, looking backwards for
the final ">", and then reading the timestamp from there.

In general, it's a losing battle to try to automatically
guess what the user meant with their broken crud. But this
particular workaround is probably worth doing.  One, it's
dirt simple, and can't impact non-broken cases. Two, it
doesn't catch a single breakage we've seen, but rather a
large class of errors (i.e., any breakage inside the email
angle brackets may affect the email, but won't spill over
into the timestamp parsing). And three, the timestamp is
arguably more valuable to get right, because it can affect
correctness (e.g., in --until cutoffs).

This patch implements the right-to-left scheme described
above. We adjust the tests in t4212, which generate a commit
with such a broken ident, and now gets the timestamp right.
We also add a test that fsck continues to detect the
breakage.

For reference, here are pointers to the breakages seen (as
numbered above):

[1] http://article.gmane.org/gmane.comp.version-control.git/221441

[2] http://article.gmane.org/gmane.comp.version-control.git/222362

[3] http://perl5.git.perl.org/perl.git/commit/13b79730adea97e660de84bbe67f9d7cbe344302

Signed-off-by: Jeff King <peff@peff.net>
---
You could take this concept further and try to do something clever with
the email when we notice the extra ">". But I think that is where this
crosses from "easily and simply covers a class of errors" into "losing
proposition trying to tweak heuristics around various breakages".

The only thing that gives me pause here is that parsing from the right
would close the door to ever adding any new information on the end of an
ident line. I'd be surprised if that door wasn't already closed by the
existing parsers, but I feel like the topic might have come up sometime
in the past year or two (but I can't seem to find anything in the
archive).

 ident.c                | 14 +++++++++++++-
 t/t4212-log-corrupt.sh |  9 +++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/ident.c b/ident.c
index 1c123e6..1a4b3ad 100644
--- a/ident.c
+++ b/ident.c
@@ -233,7 +233,19 @@ int split_ident_line(struct ident_split *split, const char *line, int len)
 	if (!split->mail_end)
 		return status;

-	for (cp = split->mail_end + 1; cp < line + len && isspace(*cp); cp++)
+	/*
+	 * Look from the end-of-line to find the trailing ">" of the mail
+	 * address, even though we should already know it as split->mail_end.
+	 * This can help in cases of broken idents with an extra ">" somewhere
+	 * in the email address.  Note that we are assuming the timestamp will
+	 * never have a ">" in it.
+	 *
+	 * Note also that this memchr can never return NULL, as we would
+	 * always find at least the split->mail_end closing bracket.
+	 */
+	cp = memrchr(split->mail_end, '>', len - (split->mail_end - line));
+
+	for (cp = cp + 1; cp < line + len && isspace(*cp); cp++)
 		;
 	if (line + len <= cp)
 		goto person_only;
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index ec5099b..93c7c36 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -13,11 +13,16 @@ test_expect_success 'git log with broken author email' '
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
 '

+test_expect_success 'fsck notices broken commit' '
+	git fsck 2>actual &&
+	test_i18ngrep invalid.author actual
+'
+
 test_expect_success 'git log with broken author email' '
 	{
 		echo commit $(cat broken_email.hash)
 		echo "Author: A U Thor <author@example.com>"
-		echo "Date:   Thu Jan 1 00:00:00 1970 +0000"
+		echo "Date:   Thu Apr 7 15:13:13 2005 -0700"
 		echo
 		echo "    foo"
 	} >expect.out &&
@@ -30,7 +35,7 @@ test_expect_success 'git log --format with broken author email' '
 '

 test_expect_success 'git log --format with broken author email' '
-	echo "A U Thor+author@example.com+" >expect.out &&
+	echo "A U Thor+author@example.com+Thu Apr 7 15:13:13 2005 -0700" >expect.out &&
 	: >expect.err &&

 	git log --format="%an+%ae+%ad" broken_email >actual.out 2>actual.err &&
-- 
1.8.4.1.4.gf327177

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 20:27 [PATCH] split_ident: parse timestamp from end of line Jeff King
@ 2013-10-14 22:25 ` Junio C Hamano
  2013-10-14 22:31   ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-10-14 22:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> You could take this concept further and try to do something clever with
> the email when we notice the extra ">". But I think that is where this
> crosses from "easily and simply covers a class of errors" into "losing
> proposition trying to tweak heuristics around various breakages".

True.

> The only thing that gives me pause here is that parsing from the right
> would close the door to ever adding any new information on the end of an
> ident line. I'd be surprised if that door wasn't already closed by the
> existing parsers, but I feel like the topic might have come up sometime
> in the past year or two (but I can't seem to find anything in the
> archive).

I do not recall any, either.

The approach to parse from the right-end feels like the simplest and
the clearest one to get the piece of information that matters in the
presence of breakages like the ones you mentioned.

> +	/*
> +	 * Look from the end-of-line to find the trailing ">" of the mail
> +	 * address, even though we should already know it as split->mail_end.
> +	 * This can help in cases of broken idents with an extra ">" somewhere
> +	 * in the email address.  Note that we are assuming the timestamp will
> +	 * never have a ">" in it.
> +	 *
> +	 * Note also that this memchr can never return NULL, as we would
> +	 * always find at least the split->mail_end closing bracket.
> +	 */
> +	cp = memrchr(split->mail_end, '>', len - (split->mail_end - line));
> +	for (cp = cp + 1; cp < line + len && isspace(*cp); cp++)
>  		;

"git grep" tells me this is the first use of memrchr(), which,
unlike memchr(), is _GNU_SOURCE-only if I am not mistaken, so we may
need a fallback definition in the compat/ and NEEDS_MEMRCHR in the
Makefile, I think.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 22:25 ` Junio C Hamano
@ 2013-10-14 22:31   ` Jeff King
  2013-10-14 22:45     ` Jeff King
  2013-10-14 22:45     ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Jeff King @ 2013-10-14 22:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Oct 14, 2013 at 03:25:29PM -0700, Junio C Hamano wrote:

> > +	/*
> > +	 * Look from the end-of-line to find the trailing ">" of the mail
> > +	 * address, even though we should already know it as split->mail_end.
> > +	 * This can help in cases of broken idents with an extra ">" somewhere
> > +	 * in the email address.  Note that we are assuming the timestamp will
> > +	 * never have a ">" in it.
> > +	 *
> > +	 * Note also that this memchr can never return NULL, as we would
> > +	 * always find at least the split->mail_end closing bracket.
> > +	 */
> > +	cp = memrchr(split->mail_end, '>', len - (split->mail_end - line));
> > +	for (cp = cp + 1; cp < line + len && isspace(*cp); cp++)
> >  		;
> 
> "git grep" tells me this is the first use of memrchr(), which,
> unlike memchr(), is _GNU_SOURCE-only if I am not mistaken, so we may
> need a fallback definition in the compat/ and NEEDS_MEMRCHR in the
> Makefile, I think.

Yeah, you are right[1]. I'm happy to re-roll. I wonder if we even need
to worry about a compatibility wrapper. We are already doing pointer
manipulations, and it is probably just as readable to roll the loop by
hand.

-Peff

[1] I even looked at "man memrchr" on my glibc system and was surprised
    to see it mentioned above the "#define _GNU_SOURCE" fold. But that
    "fold" is used only sometimes (e.g., strchrnul), and not others (in
    memrchr, the portability bits are listed at the end of the
    synopsis). Grr.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 22:31   ` Jeff King
@ 2013-10-14 22:45     ` Jeff King
  2013-10-14 22:45     ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2013-10-14 22:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Oct 14, 2013 at 06:31:37PM -0400, Jeff King wrote:

> > "git grep" tells me this is the first use of memrchr(), which,
> > unlike memchr(), is _GNU_SOURCE-only if I am not mistaken, so we may
> > need a fallback definition in the compat/ and NEEDS_MEMRCHR in the
> > Makefile, I think.
> 
> Yeah, you are right[1]. I'm happy to re-roll. I wonder if we even need
> to worry about a compatibility wrapper. We are already doing pointer
> manipulations, and it is probably just as readable to roll the loop by
> hand.

Here is that re-roll, which does:

  -	cp = memrchr(split->mail_end, '>', len - (split->mail_end - line));
  +	for (cp = line + len - 1; *cp != '>'; cp--)
  +		;

-- >8 --
Subject: split_ident: parse timestamp from end of line

Split_ident currently parses left to right. Given this
input:

  Your Name <email@example.com> 123456789 -0500\n

We assume the name starts the line and runs until the first
"<".  That starts the email address, which runs until the
first ">".  Everything after that is assumed to be the
timestamp.

This works fine in the normal case, but is easily broken by
corrupted ident lines that contain an extra ">". Some
examples seen in the wild are:

  1. Name <email>-<> 123456789 -0500\n

  2. Name <email> <Name<email>> 123456789 -0500\n

  3. Name1 <email1>, Name2 <email2> 123456789 -0500\n

Currently each of these produces some email address (which
is not necessarily the one the user intended) and end up
with a NULL date (which is generally interpreted as the
epoch by "git log" and friends).

But in each case we could get the correct timestamp simply
by parsing from the right-hand side, looking backwards for
the final ">", and then reading the timestamp from there.

In general, it's a losing battle to try to automatically
guess what the user meant with their broken crud. But this
particular workaround is probably worth doing.  One, it's
dirt simple, and can't impact non-broken cases. Two, it
doesn't catch a single breakage we've seen, but rather a
large class of errors (i.e., any breakage inside the email
angle brackets may affect the email, but won't spill over
into the timestamp parsing). And three, the timestamp is
arguably more valuable to get right, because it can affect
correctness (e.g., in --until cutoffs).

This patch implements the right-to-left scheme described
above. We adjust the tests in t4212, which generate a commit
with such a broken ident, and now gets the timestamp right.
We also add a test that fsck continues to detect the
breakage.

For reference, here are pointers to the breakages seen (as
numbered above):

[1] http://article.gmane.org/gmane.comp.version-control.git/221441

[2] http://article.gmane.org/gmane.comp.version-control.git/222362

[3] http://perl5.git.perl.org/perl.git/commit/13b79730adea97e660de84bbe67f9d7cbe344302

Signed-off-by: Jeff King <peff@peff.net>
---
 ident.c                | 16 +++++++++++++++-
 t/t4212-log-corrupt.sh |  9 +++++++--
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/ident.c b/ident.c
index 1c123e6..7d1c79c 100644
--- a/ident.c
+++ b/ident.c
@@ -233,7 +233,21 @@ int split_ident_line(struct ident_split *split, const char *line, int len)
 	if (!split->mail_end)
 		return status;

-	for (cp = split->mail_end + 1; cp < line + len && isspace(*cp); cp++)
+	/*
+	 * Look from the end-of-line to find the trailing ">" of the mail
+	 * address, even though we should already know it as split->mail_end.
+	 * This can help in cases of broken idents with an extra ">" somewhere
+	 * in the email address.  Note that we are assuming the timestamp will
+	 * never have a ">" in it.
+	 *
+	 * Note that we will always find some ">" before going off the front of
+	 * the string, because will always hit the split->mail_end closing
+	 * bracket.
+	 */
+	for (cp = line + len - 1; *cp != '>'; cp--)
+		;
+
+	for (cp = cp + 1; cp < line + len && isspace(*cp); cp++)
 		;
 	if (line + len <= cp)
 		goto person_only;
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index ec5099b..93c7c36 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -13,11 +13,16 @@ test_expect_success 'git log with broken author email' '
 	git update-ref refs/heads/broken_email $(cat broken_email.hash)
 '

+test_expect_success 'fsck notices broken commit' '
+	git fsck 2>actual &&
+	test_i18ngrep invalid.author actual
+'
+
 test_expect_success 'git log with broken author email' '
 	{
 		echo commit $(cat broken_email.hash)
 		echo "Author: A U Thor <author@example.com>"
-		echo "Date:   Thu Jan 1 00:00:00 1970 +0000"
+		echo "Date:   Thu Apr 7 15:13:13 2005 -0700"
 		echo
 		echo "    foo"
 	} >expect.out &&
@@ -30,7 +35,7 @@ test_expect_success 'git log --format with broken author email' '
 '

 test_expect_success 'git log --format with broken author email' '
-	echo "A U Thor+author@example.com+" >expect.out &&
+	echo "A U Thor+author@example.com+Thu Apr 7 15:13:13 2005 -0700" >expect.out &&
 	: >expect.err &&

 	git log --format="%an+%ae+%ad" broken_email >actual.out 2>actual.err &&
-- 
1.8.4.1.4.gf327177

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 22:31   ` Jeff King
  2013-10-14 22:45     ` Jeff King
@ 2013-10-14 22:45     ` Junio C Hamano
  2013-10-14 23:29       ` Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-10-14 22:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> Yeah, you are right[1]. I'm happy to re-roll. I wonder if we even need
> to worry about a compatibility wrapper. We are already doing pointer
> manipulations, and it is probably just as readable to roll the loop by
> hand.

Yeah, unrolling the loop is probably better.  You may even be able
to do so in a single pass with an extra "last > seen" pointer
variable without too much additional code complexity, I would think.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 22:45     ` Junio C Hamano
@ 2013-10-14 23:29       ` Jeff King
  2013-10-15 17:52         ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2013-10-14 23:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Oct 14, 2013 at 03:45:42PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Yeah, you are right[1]. I'm happy to re-roll. I wonder if we even need
> > to worry about a compatibility wrapper. We are already doing pointer
> > manipulations, and it is probably just as readable to roll the loop by
> > hand.
> 
> Yeah, unrolling the loop is probably better.  You may even be able
> to do so in a single pass with an extra "last > seen" pointer
> variable without too much additional code complexity, I would think.

I'm not sure what you mean here.

If you mean doing a single pass to find the final ">", that is easy,
because we know the length of the line already and can jump past and
start from the back.

If you mean rolling it into the loop directly below, where we jump past
the whitespace, I think it's a bit more complicated. We would not want
to stop when we see something date-like, because parsing:

  Name <<bogus.email> 5678> 1234 -0500

you would want to find "1234" as the date. You can, while you are
scanning right, keep track of the end of the whitespace after ">", but I
do not think the complication is worth much. There should typically only
be one space, so you are only saving looking at a single character.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-14 23:29       ` Jeff King
@ 2013-10-15 17:52         ` Junio C Hamano
  2013-10-15 18:03           ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-10-15 17:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

>> Yeah, unrolling the loop is probably better.  You may even be able
>> to do so in a single pass with an extra "last > seen" pointer
>> variable without too much additional code complexity, I would think.
>
> I'm not sure what you mean here.

> If you mean doing a single pass to find the final ">", that is easy,
> because we know the length of the line already and can jump past and
> start from the back.

I meant a single forward pass, like this.

 ident.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/ident.c b/ident.c
index 7d1c79c..ff29779 100644
--- a/ident.c
+++ b/ident.c
@@ -200,7 +200,7 @@ static void strbuf_addstr_without_crud(struct strbuf *sb, const char *src)
  */
 int split_ident_line(struct ident_split *split, const char *line, int len)
 {
-	const char *cp;
+	const char *cp, *last_ket;
 	size_t span;
 	int status = -1;
 
@@ -225,29 +225,22 @@ int split_ident_line(struct ident_split *split, const char *line, int len)
 		split->name_end = split->name_begin;
 	}
 
-	for (cp = split->mail_begin; cp < line + len; cp++)
-		if (*cp == '>') {
+	for (cp = split->mail_begin, last_ket = NULL; cp < line + len; cp++) {
+		if (*cp != '>')
+			continue;
+		if (!last_ket)
 			split->mail_end = cp;
-			break;
-		}
+		last_ket = cp;
+	}
 	if (!split->mail_end)
 		return status;
 
 	/*
-	 * Look from the end-of-line to find the trailing ">" of the mail
-	 * address, even though we should already know it as split->mail_end.
-	 * This can help in cases of broken idents with an extra ">" somewhere
-	 * in the email address.  Note that we are assuming the timestamp will
-	 * never have a ">" in it.
-	 *
-	 * Note that we will always find some ">" before going off the front of
-	 * the string, because will always hit the split->mail_end closing
-	 * bracket.
+	 * Typically, last_ket is the same as split_mail_end, but with
+	 * a broken identity line, there may be multiple closing ket '>';
+	 * read the timestamp after the last one.
 	 */
-	for (cp = line + len - 1; *cp != '>'; cp--)
-		;
-
-	for (cp = cp + 1; cp < line + len && isspace(*cp); cp++)
+	for (cp = last_ket + 1; cp < line + len && isspace(*cp); cp++)
 		;
 	if (line + len <= cp)
 		goto person_only;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-15 17:52         ` Junio C Hamano
@ 2013-10-15 18:03           ` Jeff King
  2013-10-15 18:48             ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2013-10-15 18:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Oct 15, 2013 at 10:52:55AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> Yeah, unrolling the loop is probably better.  You may even be able
> >> to do so in a single pass with an extra "last > seen" pointer
> >> variable without too much additional code complexity, I would think.
> >
> > I'm not sure what you mean here.
> 
> > If you mean doing a single pass to find the final ">", that is easy,
> > because we know the length of the line already and can jump past and
> > start from the back.
> 
> I meant a single forward pass, like this.

Ah, I see. You are combining with the pass before, not the pass after.

I do not think this is any more (nor less) efficient than what I posted.
We still pass over the space after split->mail_end one additional time
searching for the closing bracket. Mine is _slightly_ more efficient in
that by going backwards we can stop when we see the first '>', avoiding
looking at the space between "mail_end" and "last_ket". But that space
is 0 in the normal case, and even if it is not, we are talking about
tens of bytes at most. So I doubt it would ever matter.

My version seems a little clearer to me, but that is probably because I
wrote it. If you strongly prefer the other, feel free to mark up my
patch.

-Peff

PS I learned a new term, "ket". I always called it "closing angle
   bracket".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] split_ident: parse timestamp from end of line
  2013-10-15 18:03           ` Jeff King
@ 2013-10-15 18:48             ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2013-10-15 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> My version seems a little clearer to me, but that is probably because I
> wrote it. If you strongly prefer the other, feel free to mark up my
> patch.

I do not have strong preference either way. Just that I thought two
loops would be shorter and easier to understand than three, that's
all.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-10-15 18:49 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-14 20:27 [PATCH] split_ident: parse timestamp from end of line Jeff King
2013-10-14 22:25 ` Junio C Hamano
2013-10-14 22:31   ` Jeff King
2013-10-14 22:45     ` Jeff King
2013-10-14 22:45     ` Junio C Hamano
2013-10-14 23:29       ` Jeff King
2013-10-15 17:52         ` Junio C Hamano
2013-10-15 18:03           ` Jeff King
2013-10-15 18:48             ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).