git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
@ 2012-05-21  8:01 Bryan Turner
  2012-05-22  4:32 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Bryan Turner @ 2012-05-21  8:01 UTC (permalink / raw)
  To: git

Hello all,

I believe I've found an issue in git v1.7.10.1 and v1.7.10.2 (and
master) where the output of git rev-list has changed for some commits.
Most commits do not appear to trigger the issue; it may be related to
Unicode characters being used in the author name. I've run a git
bisect and it appears the bug was introduced in
4b340cfab9c7a18e39bc531d6a6ffaffdf95f62d.

Built from master (as well as using the 1.7.10.1 and 1.7.10.2 release
tags), I get output like this:
================================ Output =====================================
aphrael:qa-resources.git bturner$ git rev-list
--format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1
5c1ccdec5f84aa149a4978f729fdda70769f942f
commit 5c1ccdec5f84aa149a4978f729fdda70769f942f
5c1ccdec5f84aa149a4978f729fdda70769f942f|5c1ccde|02c78bc39ac6192623bf080e3e2ac892a4f5764c|02c78bc|||
commit with unicode name

@@object_end@@
================================ End ========================================

Note that the author name, e-mail and timestamp values are all missing
(the three |'s in a row at the end).

Built from 0dbe6592ccbd1a394a69a52074e3729d546fe952, the parent of
4b340cf, and in previous versions of git (1.7.10 and earlier), I got
output like this:
================================ Output =====================================
aphrael:qa-resources.git bturner$ git rev-list
--format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1
5c1ccdec5f84aa149a4978f729fdda70769f942f
commit 5c1ccdec5f84aa149a4978f729fdda70769f942f
5c1ccdec5f84aa149a4978f729fdda70769f942f|5c1ccde|02c78bc39ac6192623bf080e3e2ac892a4f5764c|02c78bc|a|farmas@atlassian.com|1327876222
commit with unicode name

@@object_end@@
================================ End ========================================

Note that the author name, e-mail and timestamp are all present. The
"a" appears as ASCII here, but it's actually a UTF-16LE character (the
terminal on the Mac is being "helpful").

To try and verify whether the difference is a bug or a bugfix (because
I wasn't certain whether perhaps the output from earlier versions was
the result of a bug which was fixed in 1.7.10.1 and on), I compared
the git rev-list output with git cat-file -p (again, built from
master):
================================ Output =====================================
aphrael:qa-resources.git bturner$ git cat-file -p
5c1ccdec5f84aa149a4978f729fdda70769f942f
tree dd173cb70baaac07bdf405f4e3db110e7fafd180
parent 02c78bc39ac6192623bf080e3e2ac892a4f5764c
author a <farmas@atlassian.com> 1327876222 +1100
committer a <farmas@atlassian.com> 1327876222 +1100

commit with unicode name
================================ End ========================================

The git cat-file output is consistent across versions of git where I'm
seeing the rev-list issue and versions where I'm not.

I would be happy to provide a zip file containing the repository with
the commit shown in all the output above, if that will facilitate
testing/fixing the issue. Just let me know where to put it. I lack the
C chops to provide a patch myself; sorry about that.

Best regards,
Bryan Turner

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
  2012-05-21  8:01 git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond Bryan Turner
@ 2012-05-22  4:32 ` Jeff King
  2012-05-22  4:35   ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2012-05-22  4:32 UTC (permalink / raw)
  To: Bryan Turner; +Cc: git

On Mon, May 21, 2012 at 06:01:50PM +1000, Bryan Turner wrote:

> Note that the author name, e-mail and timestamp values are all missing
> (the three |'s in a row at the end).
> [...]
> Built from 0dbe6592ccbd1a394a69a52074e3729d546fe952, the parent of
> 4b340cf, and in previous versions of git (1.7.10 and earlier), I got
> output like this:
> [...]
> Note that the author name, e-mail and timestamp are all present. The
> "a" appears as ASCII here, but it's actually a UTF-16LE character (the
> terminal on the Mac is being "helpful").

I'm not too surprised this is broken (in fact, I'm surprised it ever
really worked). UTF-16, especially representing pure ascii characters,
will have embedded NULs. Most of the code assumes that things like names
and emails are NUL-terminated and ascii-compatible (so ascii, or some
ascii-superset encoding like utf8, iso8859-1, etc). You can store a
commit message (and name) in utf-16 if you tell git that you are doing
so, but we should be re-encoding it before handling it.

> ================================ Output =====================================
> aphrael:qa-resources.git bturner$ git cat-file -p
> 5c1ccdec5f84aa149a4978f729fdda70769f942f
> tree dd173cb70baaac07bdf405f4e3db110e7fafd180
> parent 02c78bc39ac6192623bf080e3e2ac892a4f5764c
> author a <farmas@atlassian.com> 1327876222 +1100
> committer a <farmas@atlassian.com> 1327876222 +1100
> 
> commit with unicode name
> ================================ End ========================================

There's no encoding header, so having a utf-16 character there is wrong.
How did you make such a commit in the first place, though? I believe
that git-commit treats the name as a string and would terminate on a NUL
(or am I wrong in thinking that this "a" is really U+0061, and is
actually some other unicode character that _looks_ like "a", but doesn't
contain a NUL?).

Did you create it by piping straight to git-hash-object? What does
piping the above through "xxd" or "cat -A" show?

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
  2012-05-22  4:32 ` Jeff King
@ 2012-05-22  4:35   ` Jeff King
  2012-05-22  5:13     ` Bryan Turner
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2012-05-22  4:35 UTC (permalink / raw)
  To: Bryan Turner; +Cc: git

On Tue, May 22, 2012 at 12:32:21AM -0400, Jeff King wrote:

> I'm not too surprised this is broken (in fact, I'm surprised it ever
> really worked). UTF-16, especially representing pure ascii characters,
> will have embedded NULs. Most of the code assumes that things like names
> and emails are NUL-terminated and ascii-compatible (so ascii, or some
> ascii-superset encoding like utf8, iso8859-1, etc). You can store a
> commit message (and name) in utf-16 if you tell git that you are doing
> so, but we should be re-encoding it before handling it.

Actually, I take that back. I think storing in utf-16 would probably
fail. We need to use ascii to even read the headers to get to the
encoding header to realize that it is in utf-16. So I believe we really
do only support ascii-superset encodings.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
  2012-05-22  4:35   ` Jeff King
@ 2012-05-22  5:13     ` Bryan Turner
  2012-05-22  5:58       ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Bryan Turner @ 2012-05-22  5:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Peff,

Thanks for taking the time to respond! I appreciate it.

I went and spoke to the tester who made the original commit, and it
sounds like what he did was open .git/config with a text editor and
update the user.name directly. He then saved the file. His operating
system's code page was UTF-16LE, so I believe the file was saved in
that format. He then did the commit.

That sort of behavior is pretty far outside the scope of anything git
can realistically be expected to handle (and I told the tester that as
well). That said, I'm interested that it "suddenly" broke, and also
that cat-file still appears to not think there's any problem.

I piped the output through xxd, for cat-file and rev-list. The
cat-file output is the same in both versions of git, and looks like
this:
auri:qa-resources.git bturner$ git cat-file -p
5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd
0000000: 7472 6565 2064 6431 3733 6362 3730 6261  tree dd173cb70ba
0000010: 6161 6330 3762 6466 3430 3566 3465 3364  aac07bdf405f4e3d
0000020: 6231 3130 6537 6661 6664 3138 300a 7061  b110e7fafd180.pa
0000030: 7265 6e74 2030 3263 3738 6263 3339 6163  rent 02c78bc39ac
0000040: 3631 3932 3632 3362 6630 3830 6533 6532  6192623bf080e3e2
0000050: 6163 3839 3261 3466 3537 3634 630a 6175  ac892a4f5764c.au
0000060: 7468 6f72 2061 203c 6661 726d 6173 4061  thor a <farmas@a
0000070: 746c 6173 7369 616e 2e63 6f6d 3e20 3133  tlassian.com> 13
0000080: 3237 3837 3632 3232 202b 3131 3030 0a63  27876222 +1100.c
0000090: 6f6d 6d69 7474 6572 2061 203c 6661 726d  ommitter a <farm
00000a0: 6173 4061 746c 6173 7369 616e 2e63 6f6d  as@atlassian.com
00000b0: 3e20 3133 3237 3837 3632 3232 202b 3131  > 1327876222 +11
00000c0: 3030 0a0a 636f 6d6d 6974 2077 6974 6820  00..commit with
00000d0: 756e 6963 6f64 6520 6e61 6d65 0a         unicode name.

In 1.7.9.5, rev-list looks like this:
auri:qa-resources.git bturner$ git rev-list
--format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1
5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd
0000000: 636f 6d6d 6974 2035 6331 6363 6465 6335  commit 5c1ccdec5
0000010: 6638 3461 6131 3439 6134 3937 3866 3732  f84aa149a4978f72
0000020: 3966 6464 6137 3037 3639 6639 3432 660a  9fdda70769f942f.
0000030: 3563 3163 6364 6563 3566 3834 6161 3134  5c1ccdec5f84aa14
0000040: 3961 3439 3738 6637 3239 6664 6461 3730  9a4978f729fdda70
0000050: 3736 3966 3934 3266 7c35 6331 6363 6465  769f942f|5c1ccde
0000060: 7c30 3263 3738 6263 3339 6163 3631 3932  |02c78bc39ac6192
0000070: 3632 3362 6630 3830 6533 6532 6163 3839  623bf080e3e2ac89
0000080: 3261 3466 3537 3634 637c 3032 6337 3862  2a4f5764c|02c78b
0000090: 637c 617c 6661 726d 6173 4061 746c 6173  c|a|farmas@atlas
00000a0: 7369 616e 2e63 6f6d 7c31 3332 3738 3736  sian.com|1327876
00000b0: 3232 320a 636f 6d6d 6974 2077 6974 6820  222.commit with
00000c0: 756e 6963 6f64 6520 6e61 6d65 0a0a 4040  unicode name..@@
00000d0: 6f62 6a65 6374 5f65 6e64 4040 0a         object_end@@.

In 1.7.10.2, it looks like this:
auri:qa-resources.git bturner$ git rev-list
--format="%H|%h|%P|%p|%an|%ae|%at%n%B%n@@object_end@@" -1
5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd
0000000: 636f 6d6d 6974 2035 6331 6363 6465 6335  commit 5c1ccdec5
0000010: 6638 3461 6131 3439 6134 3937 3866 3732  f84aa149a4978f72
0000020: 3966 6464 6137 3037 3639 6639 3432 660a  9fdda70769f942f.
0000030: 3563 3163 6364 6563 3566 3834 6161 3134  5c1ccdec5f84aa14
0000040: 3961 3439 3738 6637 3239 6664 6461 3730  9a4978f729fdda70
0000050: 3736 3966 3934 3266 7c35 6331 6363 6465  769f942f|5c1ccde
0000060: 7c30 3263 3738 6263 3339 6163 3631 3932  |02c78bc39ac6192
0000070: 3632 3362 6630 3830 6533 6532 6163 3839  623bf080e3e2ac89
0000080: 3261 3466 3537 3634 637c 3032 6337 3862  2a4f5764c|02c78b
0000090: 637c 7c7c 0a63 6f6d 6d69 7420 7769 7468  c|||.commit with
00000a0: 2075 6e69 636f 6465 206e 616d 650a 0a40   unicode name..@
00000b0: 406f 626a 6563 745f 656e 6440 400a       @object_end@@.

None of the output indicates a NUL is present. I'm wondering if the
old code that rev-list was running (and maybe the code cat-file
appears to still be running) might be ignoring leading NULs until it
finds the first character, or something similar. If that were the
case, had the tester used a name like "ab" (or anything more than one
character), perhaps the commit would have caused unexpected behavior
all along.

I don't know how to dump the binary for the commit object itself; that
may give us better information on what the author segment actually
contains. Unfortunately, the repository has been packed, so there is
no loose object for the commit anymore.

If there's any other information I can provide, please let me know.

Best regards,
Bryan Turner

On 22 May 2012 14:35, Jeff King <peff@peff.net> wrote:
> On Tue, May 22, 2012 at 12:32:21AM -0400, Jeff King wrote:
>
>> I'm not too surprised this is broken (in fact, I'm surprised it ever
>> really worked). UTF-16, especially representing pure ascii characters,
>> will have embedded NULs. Most of the code assumes that things like names
>> and emails are NUL-terminated and ascii-compatible (so ascii, or some
>> ascii-superset encoding like utf8, iso8859-1, etc). You can store a
>> commit message (and name) in utf-16 if you tell git that you are doing
>> so, but we should be re-encoding it before handling it.
>
> Actually, I take that back. I think storing in utf-16 would probably
> fail. We need to use ascii to even read the headers to get to the
> encoding header to realize that it is in utf-16. So I believe we really
> do only support ascii-superset encodings.
>
> -Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
  2012-05-22  5:13     ` Bryan Turner
@ 2012-05-22  5:58       ` Jeff King
  2012-05-22  6:13         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2012-05-22  5:58 UTC (permalink / raw)
  To: Bryan Turner; +Cc: git

On Tue, May 22, 2012 at 03:13:09PM +1000, Bryan Turner wrote:

> I went and spoke to the tester who made the original commit, and it
> sounds like what he did was open .git/config with a text editor and
> update the user.name directly. He then saved the file. His operating
> system's code page was UTF-16LE, so I believe the file was saved in
> that format. He then did the commit.

That doesn't make sense. Git will not read a config file in utf16
(again, because of the NULs). Nor can you store NULs with "git config
user.name ...". So I'm not sure how utf16 might have made it there.

> That sort of behavior is pretty far outside the scope of anything git
> can realistically be expected to handle (and I told the tester that as
> well). That said, I'm interested that it "suddenly" broke, and also
> that cat-file still appears to not think there's any problem.

Yeah, I'm worried that it's indicative of a bug that might affect other
cases.

> I piped the output through xxd, for cat-file and rev-list. The
> cat-file output is the same in both versions of git, and looks like
> this:
> auri:qa-resources.git bturner$ git cat-file -p
> 5c1ccdec5f84aa149a4978f729fdda70769f942f | xxd
> 0000000: 7472 6565 2064 6431 3733 6362 3730 6261  tree dd173cb70ba
> 0000010: 6161 6330 3762 6466 3430 3566 3465 3364  aac07bdf405f4e3d
> 0000020: 6231 3130 6537 6661 6664 3138 300a 7061  b110e7fafd180.pa
> 0000030: 7265 6e74 2030 3263 3738 6263 3339 6163  rent 02c78bc39ac
> 0000040: 3631 3932 3632 3362 6630 3830 6533 6532  6192623bf080e3e2
> 0000050: 6163 3839 3261 3466 3537 3634 630a 6175  ac892a4f5764c.au
> 0000060: 7468 6f72 2061 203c 6661 726d 6173 4061  thor a <farmas@a
> 0000070: 746c 6173 7369 616e 2e63 6f6d 3e20 3133  tlassian.com> 13
> 0000080: 3237 3837 3632 3232 202b 3131 3030 0a63  27876222 +1100.c
> 0000090: 6f6d 6d69 7474 6572 2061 203c 6661 726d  ommitter a <farm
> 00000a0: 6173 4061 746c 6173 7369 616e 2e63 6f6d  as@atlassian.com
> 00000b0: 3e20 3133 3237 3837 3632 3232 202b 3131  > 1327876222 +11
> 00000c0: 3030 0a0a 636f 6d6d 6974 2077 6974 6820  00..commit with
> 00000d0: 756e 6963 6f64 6520 6e61 6d65 0a         unicode name.

But I don't see any unicode at all here. The author and committer names
are just ascii 'a'. However, that seems to be the real problem. If I
create a commit with a single-letter name, I see some weirdness:

  $ git init
  $ echo content >foo && git add foo
  $ GIT_AUTHOR_NAME=a git commit -m msg
   Author:  <>
   1 file changed, 1 insertion(+)
   create mode 100644 foo

Uh oh, that's odd. And worse:

  $ git log -1 --format='|%an <%ae>|'
  | <>|

But in v1.7.10:

  $ git log -1 --format='|%an <%ae>|'
  |a <peff@peff.net>|

So there is definitely a bug. The unicode thing is a red herring, and if
there was any unicode at some point, git stripped it out when making the
commit. The real regression seems to be in single-character names.

I'll see if I can find the bug.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond
  2012-05-22  5:58       ` Jeff King
@ 2012-05-22  6:13         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-05-22  6:13 UTC (permalink / raw)
  To: Bryan Turner; +Cc: git

On Tue, May 22, 2012 at 01:58:11AM -0400, Jeff King wrote:

>   $ git init
>   $ echo content >foo && git add foo
>   $ GIT_AUTHOR_NAME=a git commit -m msg
>    Author:  <>
>    1 file changed, 1 insertion(+)
>    create mode 100644 foo
> 
> Uh oh, that's odd. And worse:
> 
>   $ git log -1 --format='|%an <%ae>|'
>   | <>|
> 
> So there is definitely a bug. The unicode thing is a red herring, and if
> there was any unicode at some point, git stripped it out when making the
> commit. The real regression seems to be in single-character names.
> 
> I'll see if I can find the bug.

Both are caused by an off-by-one error in split_ident_line. I just
posted a patch.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-05-22  6:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-21  8:01 git rev-list %an, %ae, %at bug in v1.7.10.1 and beyond Bryan Turner
2012-05-22  4:32 ` Jeff King
2012-05-22  4:35   ` Jeff King
2012-05-22  5:13     ` Bryan Turner
2012-05-22  5:58       ` Jeff King
2012-05-22  6:13         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).