git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* segfault for git log --graph --no-walk --grep a
@ 2013-02-08 23:52 Thomas Haller
  2013-02-09  0:05 ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Haller @ 2013-02-08 23:52 UTC (permalink / raw)
  To: Git List

[-- Attachment #1: Type: Text/Plain, Size: 2062 bytes --]

Hallo,


I just found that git crashes with a segmentation fault when calling
$ git log --graph --no-walk --grep pattern

It happens both for version 1.7.10.4 from Debian (wheezy,amd64) and
a fresh compiled git from github (git.git repository, master).


For the error to occure, the pattern must match the commit message in HEAD.




How to reproduce:
  git init .
  git commit -m 'text1' --allow-empty
  git commit -m 'text2' --allow-empty
  git log --graph --no-walk --grep 'text2'



And here is coredump I got:

Core was generated by `git log --graph --no-walk --grep text2'.
Program terminated with signal 11, Segmentation fault.
#0  __strlen_sse42 () at ../sysdeps/x86_64/multiarch/strlen-sse4.S:32
#1  0x00000000004cc13e in commit_match (opt=0x7fffbd0ee500, commit=0x216d1a8) at revision.c:2306
#2  get_commit_action (revs=0x7fffbd0ee500, commit=0x216d1a8) at revision.c:2338
#3  0x00000000004984b4 in graph_is_interesting (commit=<optimized out>, graph=<error reading variable: Unhandled dwarf expression opcode 0xfa>) at graph.c:330
#4  0x0000000000498569 in first_interesting_parent (graph=graph@entry=0x21629c0) at graph.c:369
#5  0x000000000049965e in graph_update (graph=0x21629c0, commit=<optimized out>) at graph.c:593
#6  0x00000000004cc7a9 in get_revision (revs=revs@entry=0x7fffbd0ee500) at revision.c:2580
#7  0x000000000043988a in cmd_log_walk (rev=rev@entry=0x7fffbd0ee500) at builtin/log.c:309
#8  0x000000000043a398 in cmd_log (argc=9, argv=0x2162930, prefix=0x0) at builtin/log.c:582
#9  0x0000000000405988 in run_builtin (argv=0x2162930, argc=9, p=0x751438) at git.c:281
#10 handle_internal_command (argc=9, argv=0x2162930) at git.c:443
#11 0x0000000000404df2 in run_argv (argv=0x7fffbd0eec00, argcp=0x7fffbd0eec0c) at git.c:489
#12 main (argc=9, argv=0x2162930) at git.c:564


it happens in file revision.c:2306 because "commit->buffer" is zero:

                retval = grep_buffer(&opt->grep_filter,
                                     commit->buffer, strlen(commit->buffer));





thank you all, for this awesome software.
Thomas

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-08 23:52 segfault for git log --graph --no-walk --grep a Thomas Haller
@ 2013-02-09  0:05 ` Junio C Hamano
  2013-02-09  0:22   ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-09  0:05 UTC (permalink / raw)
  To: Thomas Haller; +Cc: Git List, Jeff King

Thomas Haller <thom311@gmail.com> writes:

> it happens in file revision.c:2306 because "commit->buffer" is zero:
>
>                 retval = grep_buffer(&opt->grep_filter,
>                                      commit->buffer, strlen(commit->buffer));

I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
load missing commit buffers, 2013-01-26); I haven't merged it to any
of the maintenance releases, but the tip of 'master' should have it
already.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:05 ` Junio C Hamano
@ 2013-02-09  0:22   ` Junio C Hamano
  2013-02-09  0:27     ` Jeff King
  2013-02-09  0:29     ` segfault for git log --graph --no-walk --grep a Junio C Hamano
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2013-02-09  0:22 UTC (permalink / raw)
  To: Thomas Haller; +Cc: Git List, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Haller <thom311@gmail.com> writes:
>
>> it happens in file revision.c:2306 because "commit->buffer" is zero:
>>
>>                 retval = grep_buffer(&opt->grep_filter,
>>                                      commit->buffer, strlen(commit->buffer));
>
> I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
> load missing commit buffers, 2013-01-26); I haven't merged it to any
> of the maintenance releases, but the tip of 'master' should have it
> already.

Ah, no, this shares the same roots as the breakage the said commit
fixed, and the solution should use the same idea, but not fixed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:22   ` Junio C Hamano
@ 2013-02-09  0:27     ` Jeff King
  2013-02-09  0:39       ` Junio C Hamano
  2013-02-09  0:29     ` segfault for git log --graph --no-walk --grep a Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2013-02-09  0:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Fri, Feb 08, 2013 at 04:22:15PM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Thomas Haller <thom311@gmail.com> writes:
> >
> >> it happens in file revision.c:2306 because "commit->buffer" is zero:
> >>
> >>                 retval = grep_buffer(&opt->grep_filter,
> >>                                      commit->buffer, strlen(commit->buffer));
> >
> > I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
> > load missing commit buffers, 2013-01-26); I haven't merged it to any
> > of the maintenance releases, but the tip of 'master' should have it
> > already.
> 
> Ah, no, this shares the same roots as the breakage the said commit
> fixed, and the solution should use the same idea, but not fixed.

Yeah, I think this needs separate treatment. However, this is a perfect
example of the "case-by-case" I mention in the final two paragraphs
there.

What's the right encoding to be grepping in? The original, what we will
output in, or even something else? IOW, I wonder if this should be using
logmsg_reencode in the first place (the obvious reason not to want to do
so is speed, but logmsg_reencode is written to only have an impact in
the case that we do indeed need to reencode).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:22   ` Junio C Hamano
  2013-02-09  0:27     ` Jeff King
@ 2013-02-09  0:29     ` Junio C Hamano
  2013-02-09  0:39       ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-09  0:29 UTC (permalink / raw)
  To: Thomas Haller; +Cc: Git List, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> Thomas Haller <thom311@gmail.com> writes:
>>
>>> it happens in file revision.c:2306 because "commit->buffer" is zero:
>>>
>>>                 retval = grep_buffer(&opt->grep_filter,
>>>                                      commit->buffer, strlen(commit->buffer));
>>
>> I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
>> load missing commit buffers, 2013-01-26); I haven't merged it to any
>> of the maintenance releases, but the tip of 'master' should have it
>> already.
>
> Ah, no, this shares the same roots as the breakage the said commit
> fixed, and the solution should use the same idea, but not fixed.

Perhaps something along this line...

-- >8 --
Subject: "log --grep": commit's buffer may already have been discarded

Following up on be5c9fb9049e (logmsg_reencode: lazily load missing
commit buffers, 2013-01-26), extract the part that reads the commit
buffer data into a separate helper function, and use it when we
apply the grep filter on the commit during the log walk.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 The reproduction recipe in original bug report looked like this:

  git commit -m 'text1' --allow-empty
  git commit -m 'text2' --allow-empty
  git log --graph --no-walk --grep 'text2'

 which does not make any sense to me.  We should simply forbid
 combination of --graph (which inherently wants a connected history)
 and --no-walk (which is a way to tell "This is not about history,
 this is about a single point").

 But that is a separate issue.

 commit.c   | 19 +++++++++++++++++++
 commit.h   |  1 +
 pretty.c   | 14 ++------------
 revision.c | 14 +++++++++++---
 4 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/commit.c b/commit.c
index e8eb0ae..c0acf0f 100644
--- a/commit.c
+++ b/commit.c
@@ -335,6 +335,25 @@ int parse_commit(struct commit *item)
 	return ret;
 }
 
+char *read_commit_object_data(const struct commit *commit, unsigned long *sizep)
+{
+	char *msg;
+	enum object_type type;
+	unsigned long size;
+
+	if (!sizep)
+		sizep = &size;
+
+	msg = read_sha1_file(commit->object.sha1, &type, sizep);
+	if (!msg)
+		die("Cannot read commit object %s",
+		    sha1_to_hex(commit->object.sha1));
+	if (type != OBJ_COMMIT)
+		die("Expected commit for '%s', got %s",
+		    sha1_to_hex(commit->object.sha1), typename(type));
+	return msg;
+}
+
 int find_commit_subject(const char *commit_buffer, const char **subject)
 {
 	const char *eol;
diff --git a/commit.h b/commit.h
index 4138bb4..e314149 100644
--- a/commit.h
+++ b/commit.h
@@ -102,6 +102,7 @@ struct rev_info; /* in revision.h, it circularly uses enum cmit_fmt */
 extern char *logmsg_reencode(const struct commit *commit,
 			     const char *output_encoding);
 extern void logmsg_free(char *msg, const struct commit *commit);
+extern char *read_commit_object_data(const struct commit *commit, unsigned long *size);
 extern void get_commit_format(const char *arg, struct rev_info *);
 extern const char *format_subject(struct strbuf *sb, const char *msg,
 				  const char *line_separator);
diff --git a/pretty.c b/pretty.c
index eae57ad..004d16d 100644
--- a/pretty.c
+++ b/pretty.c
@@ -592,18 +592,8 @@ char *logmsg_reencode(const struct commit *commit,
 	char *msg = commit->buffer;
 	char *out;
 
-	if (!msg) {
-		enum object_type type;
-		unsigned long size;
-
-		msg = read_sha1_file(commit->object.sha1, &type, &size);
-		if (!msg)
-			die("Cannot read commit object %s",
-			    sha1_to_hex(commit->object.sha1));
-		if (type != OBJ_COMMIT)
-			die("Expected commit for '%s', got %s",
-			    sha1_to_hex(commit->object.sha1), typename(type));
-	}
+	if (!msg)
+		msg = read_commit_object_data(commit, NULL);
 
 	if (!output_encoding || !*output_encoding)
 		return msg;
diff --git a/revision.c b/revision.c
index d7562ee..caf8ef3 100644
--- a/revision.c
+++ b/revision.c
@@ -2279,9 +2279,16 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 		strbuf_addch(&buf, '\n');
 	}
 
-	/* Copy the commit to temporary if we are using "fake" headers */
-	if (buf.len)
+	if (!commit->buffer) {
+		/* we may not have commit->buffer */
+		unsigned long size;
+		char *msg = read_commit_object_data(commit, &size);
+		strbuf_add(&buf, msg, size);
+		free(msg);
+	} else if (buf.len) {
+		/* Copy the commit to temporary if we are using "fake" headers */
 		strbuf_addstr(&buf, commit->buffer);
+	}
 
 	if (opt->grep_filter.header_list && opt->mailmap) {
 		if (!buf.len)
@@ -2302,9 +2309,10 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 	/* Find either in the commit object, or in the temporary */
 	if (buf.len)
 		retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
-	else
+	else {
 		retval = grep_buffer(&opt->grep_filter,
 				     commit->buffer, strlen(commit->buffer));
+	}
 	strbuf_release(&buf);
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:27     ` Jeff King
@ 2013-02-09  0:39       ` Junio C Hamano
  2013-02-09  0:47         ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-09  0:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Jeff King <peff@peff.net> writes:

> On Fri, Feb 08, 2013 at 04:22:15PM -0800, Junio C Hamano wrote:
>
>> Junio C Hamano <gitster@pobox.com> writes:
>> 
>> > Thomas Haller <thom311@gmail.com> writes:
>> >
>> >> it happens in file revision.c:2306 because "commit->buffer" is zero:
>> >>
>> >>                 retval = grep_buffer(&opt->grep_filter,
>> >>                                      commit->buffer, strlen(commit->buffer));
>> >
>> > I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
>> > load missing commit buffers, 2013-01-26); I haven't merged it to any
>> > of the maintenance releases, but the tip of 'master' should have it
>> > already.
>> 
>> Ah, no, this shares the same roots as the breakage the said commit
>> fixed, and the solution should use the same idea, but not fixed.
>
> Yeah, I think this needs separate treatment. However, this is a perfect
> example of the "case-by-case" I mention in the final two paragraphs
> there.
>
> What's the right encoding to be grepping in? The original, what we will
> output in, or even something else? IOW, I wonder if this should be using
> logmsg_reencode in the first place (the obvious reason not to want to do
> so is speed, but logmsg_reencode is written to only have an impact in
> the case that we do indeed need to reencode).

Yeah, that actually is a good point.  We should be using logmsg_reencode
so that we look for strings in the user's encoding.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:29     ` segfault for git log --graph --no-walk --grep a Junio C Hamano
@ 2013-02-09  0:39       ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2013-02-09  0:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Fri, Feb 08, 2013 at 04:29:11PM -0800, Junio C Hamano wrote:

> Perhaps something along this line...
> 
> -- >8 --
> Subject: "log --grep": commit's buffer may already have been discarded
> 
> Following up on be5c9fb9049e (logmsg_reencode: lazily load missing
> commit buffers, 2013-01-26), extract the part that reads the commit
> buffer data into a separate helper function, and use it when we
> apply the grep filter on the commit during the log walk.

This obviously makes sense if we don't want to get the route of
re-encoding for grep. Re-encoding would be a user-visible change, but I
wonder if it is the right thing to be doing.

> diff --git a/revision.c b/revision.c
> index d7562ee..caf8ef3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -2279,9 +2279,16 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
>  		strbuf_addch(&buf, '\n');
>  	}
>  
> -	/* Copy the commit to temporary if we are using "fake" headers */
> -	if (buf.len)
> +	if (!commit->buffer) {
> +		/* we may not have commit->buffer */
> +		unsigned long size;
> +		char *msg = read_commit_object_data(commit, &size);
> +		strbuf_add(&buf, msg, size);
> +		free(msg);
> +	} else if (buf.len) {
> +		/* Copy the commit to temporary if we are using "fake" headers */
>  		strbuf_addstr(&buf, commit->buffer);
> +	}

Hmm. It would be nice to avoid the extra copy when we do not otherwise
need to use the strbuf. I would have expected something more like:

  const char *msg = commit->buffer;
  if (!msg)
          msg = read_commit_object_data(commit, NULL);

  [...]

  if (buf.len)
          retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
  else
          retval = grep_buffer(&opt->grep_filter, msg, strlen(msg));

  strbuf_release(&buf);
  if (msg != commit->buffer)
          free(msg);
  return retval;

You would also need to adjust the other uses of commit->buffer
throughout the function to refer to msg.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:39       ` Junio C Hamano
@ 2013-02-09  0:47         ` Junio C Hamano
  2013-02-09  1:05           ` Jeff King
  2013-02-11 19:16           ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2013-02-09  0:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Fri, Feb 08, 2013 at 04:22:15PM -0800, Junio C Hamano wrote:
>>
>>> Junio C Hamano <gitster@pobox.com> writes:
>>> 
>>> > Thomas Haller <thom311@gmail.com> writes:
>>> >
>>> >> it happens in file revision.c:2306 because "commit->buffer" is zero:
>>> >>
>>> >>                 retval = grep_buffer(&opt->grep_filter,
>>> >>                                      commit->buffer, strlen(commit->buffer));
>>> >
>>> > I think this has been fixed at be5c9fb9049e (logmsg_reencode: lazily
>>> > load missing commit buffers, 2013-01-26); I haven't merged it to any
>>> > of the maintenance releases, but the tip of 'master' should have it
>>> > already.
>>> 
>>> Ah, no, this shares the same roots as the breakage the said commit
>>> fixed, and the solution should use the same idea, but not fixed.
>>
>> Yeah, I think this needs separate treatment. However, this is a perfect
>> example of the "case-by-case" I mention in the final two paragraphs
>> there.
>>
>> What's the right encoding to be grepping in? The original, what we will
>> output in, or even something else? IOW, I wonder if this should be using
>> logmsg_reencode in the first place (the obvious reason not to want to do
>> so is speed, but logmsg_reencode is written to only have an impact in
>> the case that we do indeed need to reencode).
>
> Yeah, that actually is a good point.  We should be using logmsg_reencode
> so that we look for strings in the user's encoding.

Perhaps like this.  Just like the previous one (which should be
discarded), this makes the function always use the temporary strbuf,
so doing this upfront actually loses more code than it adds ;-)

 revision.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/revision.c b/revision.c
index d7562ee..07bf728 100644
--- a/revision.c
+++ b/revision.c
@@ -2269,6 +2269,9 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 {
 	int retval;
 	struct strbuf buf = STRBUF_INIT;
+	char *message;
+	const char *encoding;
+
 	if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
 		return 1;
 
@@ -2279,32 +2282,24 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 		strbuf_addch(&buf, '\n');
 	}
 
-	/* Copy the commit to temporary if we are using "fake" headers */
-	if (buf.len)
-		strbuf_addstr(&buf, commit->buffer);
+	encoding = get_log_output_encoding();
+	message = logmsg_reencode(commit, encoding);
+	strbuf_addstr(&buf, message);
+	if (message != commit->buffer)
+		free(message);
 
 	if (opt->grep_filter.header_list && opt->mailmap) {
-		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
-
 		commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
 		commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
 	}
 
 	/* Append "fake" message parts as needed */
-	if (opt->show_notes) {
-		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
+	if (opt->show_notes)
 		format_display_notes(commit->object.sha1, &buf,
-				     get_log_output_encoding(), 1);
-	}
+				     encoding, 1);
+
+	retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
 
-	/* Find either in the commit object, or in the temporary */
-	if (buf.len)
-		retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
-	else
-		retval = grep_buffer(&opt->grep_filter,
-				     commit->buffer, strlen(commit->buffer));
 	strbuf_release(&buf);
 	return retval;
 }

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:47         ` Junio C Hamano
@ 2013-02-09  1:05           ` Jeff King
  2013-02-09  1:08             ` Jeff King
  2013-02-11 19:16           ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2013-02-09  1:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:

> > Yeah, that actually is a good point.  We should be using logmsg_reencode
> > so that we look for strings in the user's encoding.
> 
> Perhaps like this.  Just like the previous one (which should be
> discarded), this makes the function always use the temporary strbuf,
> so doing this upfront actually loses more code than it adds ;-)

I like code simplification, but I worry a little about paying for the
extra copy in the common case. I did a best-of-five "git rev-list
--count --grep=foo HEAD" before and after your patch, though, and the
difference was well within the noise. So maybe it's not worth caring
about.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  1:05           ` Jeff King
@ 2013-02-09  1:08             ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2013-02-09  1:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Fri, Feb 08, 2013 at 08:05:24PM -0500, Jeff King wrote:

> On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:
> 
> > > Yeah, that actually is a good point.  We should be using logmsg_reencode
> > > so that we look for strings in the user's encoding.
> > 
> > Perhaps like this.  Just like the previous one (which should be
> > discarded), this makes the function always use the temporary strbuf,
> > so doing this upfront actually loses more code than it adds ;-)
> 
> I like code simplification, but I worry a little about paying for the
> extra copy in the common case. I did a best-of-five "git rev-list
> --count --grep=foo HEAD" before and after your patch, though, and the
> difference was well within the noise. So maybe it's not worth caring
> about.

Oh, hold on, I'm incompetent. I measured the wrong build of git. Here
are the timings for git.git:

  [before]
  $ best-of-five git rev-list --count --grep=foo HEAD
  Attempt 1: 0.503
  Attempt 2: 0.5
  Attempt 3: 0.501
  Attempt 4: 0.502
  Attempt 5: 0.5

  real    0m0.500s
  user    0m0.488s
  sys     0m0.008s

  [after]
  $ best-of-five git rev-list --count --grep=foo HEAD
  Attempt 1: 0.514
  Attempt 2: 0.525
  Attempt 3: 0.517
  Attempt 4: 0.523
  Attempt 5: 0.518

  real    0m0.514s
  user    0m0.480s
  sys     0m0.028s

So not huge, but measurable.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-09  0:47         ` Junio C Hamano
  2013-02-09  1:05           ` Jeff King
@ 2013-02-11 19:16           ` Jeff King
  2013-02-11 20:01             ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2013-02-11 19:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:

> > Yeah, that actually is a good point.  We should be using logmsg_reencode
> > so that we look for strings in the user's encoding.
> 
> Perhaps like this.  Just like the previous one (which should be
> discarded), this makes the function always use the temporary strbuf,
> so doing this upfront actually loses more code than it adds ;-)

I didn't see this in What's Cooking or pu. We should probably pick an
approach and go with it.

I think using logmsg_reencode makes sense. I'd be in favor of avoiding
the extra copy in the common case, so something like the patch below. If
you feel strongly about the code cleanup at the minor run-time expense,
I won't argue too much, though.

---
diff --git a/revision.c b/revision.c
index d7562ee..a08d0a5 100644
--- a/revision.c
+++ b/revision.c
@@ -2268,7 +2268,10 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 static int commit_match(struct commit *commit, struct rev_info *opt)
 {
 	int retval;
+	const char *encoding;
+	const char *message;
 	struct strbuf buf = STRBUF_INIT;
+
 	if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
 		return 1;
 
@@ -2279,13 +2282,23 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 		strbuf_addch(&buf, '\n');
 	}
 
+	/*
+	 * We grep in the user's output encoding, under the assumption that it
+	 * is the encoding they are most likely to write their grep pattern
+	 * for. In addition, it means we will match the "notes" encoding below,
+	 * so we will not end up with a buffer that has two different encodings
+	 * in it.
+	 */
+	encoding = get_log_output_encoding();
+	message = logmsg_reencode(commit, encoding);
+
 	/* Copy the commit to temporary if we are using "fake" headers */
 	if (buf.len)
-		strbuf_addstr(&buf, commit->buffer);
+		strbuf_addstr(&buf, message);
 
 	if (opt->grep_filter.header_list && opt->mailmap) {
 		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
+			strbuf_addstr(&buf, message);
 
 		commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
 		commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
@@ -2294,18 +2307,18 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 	/* Append "fake" message parts as needed */
 	if (opt->show_notes) {
 		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
-		format_display_notes(commit->object.sha1, &buf,
-				     get_log_output_encoding(), 1);
+			strbuf_addstr(&buf, message);
+		format_display_notes(commit->object.sha1, &buf, encoding, 1);
 	}
 
-	/* Find either in the commit object, or in the temporary */
+	/* Find either in the original commit message, or in the temporary */
 	if (buf.len)
 		retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
 	else
 		retval = grep_buffer(&opt->grep_filter,
-				     commit->buffer, strlen(commit->buffer));
+				     message, strlen(message));
 	strbuf_release(&buf);
+	logmsg_free(message, commit);
 	return retval;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-11 19:16           ` Jeff King
@ 2013-02-11 20:01             ` Junio C Hamano
  2013-02-11 20:36               ` Junio C Hamano
  2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2013-02-11 20:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Jeff King <peff@peff.net> writes:

> On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:
>
>> > Yeah, that actually is a good point.  We should be using logmsg_reencode
>> > so that we look for strings in the user's encoding.
>> 
>> Perhaps like this.  Just like the previous one (which should be
>> discarded), this makes the function always use the temporary strbuf,
>> so doing this upfront actually loses more code than it adds ;-)
>
> I didn't see this in What's Cooking or pu. We should probably pick an
> approach and go with it.
>
> I think using logmsg_reencode makes sense. I'd be in favor of avoiding
> the extra copy in the common case, so something like the patch below. If
> you feel strongly about the code cleanup at the minor run-time expense,
> I won't argue too much, though.

Sounds good to me.  Care to do the log message while at it?

> ---
> diff --git a/revision.c b/revision.c
> index d7562ee..a08d0a5 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -2268,7 +2268,10 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
>  static int commit_match(struct commit *commit, struct rev_info *opt)
>  {
>  	int retval;
> +	const char *encoding;
> +	const char *message;
>  	struct strbuf buf = STRBUF_INIT;
> +
>  	if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
>  		return 1;
>  
> @@ -2279,13 +2282,23 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
>  		strbuf_addch(&buf, '\n');
>  	}
>  
> +	/*
> +	 * We grep in the user's output encoding, under the assumption that it
> +	 * is the encoding they are most likely to write their grep pattern
> +	 * for. In addition, it means we will match the "notes" encoding below,
> +	 * so we will not end up with a buffer that has two different encodings
> +	 * in it.
> +	 */
> +	encoding = get_log_output_encoding();
> +	message = logmsg_reencode(commit, encoding);
> +
>  	/* Copy the commit to temporary if we are using "fake" headers */
>  	if (buf.len)
> -		strbuf_addstr(&buf, commit->buffer);
> +		strbuf_addstr(&buf, message);
>  
>  	if (opt->grep_filter.header_list && opt->mailmap) {
>  		if (!buf.len)
> -			strbuf_addstr(&buf, commit->buffer);
> +			strbuf_addstr(&buf, message);
>  
>  		commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
>  		commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
> @@ -2294,18 +2307,18 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
>  	/* Append "fake" message parts as needed */
>  	if (opt->show_notes) {
>  		if (!buf.len)
> -			strbuf_addstr(&buf, commit->buffer);
> -		format_display_notes(commit->object.sha1, &buf,
> -				     get_log_output_encoding(), 1);
> +			strbuf_addstr(&buf, message);
> +		format_display_notes(commit->object.sha1, &buf, encoding, 1);
>  	}
>  
> -	/* Find either in the commit object, or in the temporary */
> +	/* Find either in the original commit message, or in the temporary */
>  	if (buf.len)
>  		retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
>  	else
>  		retval = grep_buffer(&opt->grep_filter,
> -				     commit->buffer, strlen(commit->buffer));
> +				     message, strlen(message));
>  	strbuf_release(&buf);
> +	logmsg_free(message, commit);
>  	return retval;
>  }
>  

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-11 20:01             ` Junio C Hamano
@ 2013-02-11 20:36               ` Junio C Hamano
  2013-02-11 20:41                 ` Jeff King
  2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-11 20:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:
>>
>>> > Yeah, that actually is a good point.  We should be using logmsg_reencode
>>> > so that we look for strings in the user's encoding.
>>> 
>>> Perhaps like this.  Just like the previous one (which should be
>>> discarded), this makes the function always use the temporary strbuf,
>>> so doing this upfront actually loses more code than it adds ;-)
>>
>> I didn't see this in What's Cooking or pu. We should probably pick an
>> approach and go with it.
>>
>> I think using logmsg_reencode makes sense. I'd be in favor of avoiding
>> the extra copy in the common case, so something like the patch below. If
>> you feel strongly about the code cleanup at the minor run-time expense,
>> I won't argue too much, though.
>
> Sounds good to me.  Care to do the log message while at it?

Heh, how about this?  I still need a sign-off from you.

Thanks.

commit 6a76814cd08cac15c1faff5bd97c9e94ac8b6a01
Author: Jeff King <peff@peff.net>
Date:   Mon Feb 11 14:16:07 2013 -0500

    log --grep: look for the given string in log output encoding
    
    We used to grep in the raw commit buffer contents, possibly pieces
    of notes encoded in log output encoding appended to it, which was
    insane.
    
    Convert the contents of the commit message also to log output
    encoding before looking for the string.  This incidentally fixes a
    possible NULL dereference that can happen when commit->buffer has
    already been freed, which can happen with
    
    	git commit -m 'text1' --allow-empty
    	git commit -m 'text2' --allow-empty
    	git log --graph --no-walk --grep 'text2'
    
    which arguably does not make any sense (--graph inherently wants a
    connected history, and by --no-walk the command line is telling us
    to show discrete points in history without connectivity), and we
    probably should forbid the combination, but that is a separate
    issue.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-11 20:36               ` Junio C Hamano
@ 2013-02-11 20:41                 ` Jeff King
  2013-02-11 20:55                   ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2013-02-11 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Mon, Feb 11, 2013 at 12:36:52PM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Jeff King <peff@peff.net> writes:
> >
> >> On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:
> >>
> >>> > Yeah, that actually is a good point.  We should be using logmsg_reencode
> >>> > so that we look for strings in the user's encoding.
> >>> 
> >>> Perhaps like this.  Just like the previous one (which should be
> >>> discarded), this makes the function always use the temporary strbuf,
> >>> so doing this upfront actually loses more code than it adds ;-)
> >>
> >> I didn't see this in What's Cooking or pu. We should probably pick an
> >> approach and go with it.
> >>
> >> I think using logmsg_reencode makes sense. I'd be in favor of avoiding
> >> the extra copy in the common case, so something like the patch below. If
> >> you feel strongly about the code cleanup at the minor run-time expense,
> >> I won't argue too much, though.
> >
> > Sounds good to me.  Care to do the log message while at it?
> 
> Heh, how about this?  I still need a sign-off from you.

I'm working on the log message and tests right now. There's also a minor
code fixup needed to compile with -Wall.

>     log --grep: look for the given string in log output encoding
>     
>     We used to grep in the raw commit buffer contents, possibly pieces
>     of notes encoded in log output encoding appended to it, which was
>     insane.
>     
>     Convert the contents of the commit message also to log output
>     encoding before looking for the string.  This incidentally fixes a
>     possible NULL dereference that can happen when commit->buffer has
>     already been freed, which can happen with
>     
>     	git commit -m 'text1' --allow-empty
>     	git commit -m 'text2' --allow-empty
>     	git log --graph --no-walk --grep 'text2'
>     
>     which arguably does not make any sense (--graph inherently wants a
>     connected history, and by --no-walk the command line is telling us
>     to show discrete points in history without connectivity), and we
>     probably should forbid the combination, but that is a separate
>     issue.

I'll use bits of that. I had sort of punted on the "how to reproduce the
segfault" issue entirely because you had noted that it was not a sane
thing to do. Still, I think it makes sense to mention it with the caveat
you give here.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: segfault for git log --graph --no-walk --grep a
  2013-02-11 20:41                 ` Jeff King
@ 2013-02-11 20:55                   ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2013-02-11 20:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Jeff King <peff@peff.net> writes:

> On Mon, Feb 11, 2013 at 12:36:52PM -0800, Junio C Hamano wrote:
>
>> Junio C Hamano <gitster@pobox.com> writes:
>> 
>> > Jeff King <peff@peff.net> writes:
>> >
>> >> On Fri, Feb 08, 2013 at 04:47:01PM -0800, Junio C Hamano wrote:
>> >>
>> >>> > Yeah, that actually is a good point.  We should be using logmsg_reencode
>> >>> > so that we look for strings in the user's encoding.
>> >>> 
>> >>> Perhaps like this.  Just like the previous one (which should be
>> >>> discarded), this makes the function always use the temporary strbuf,
>> >>> so doing this upfront actually loses more code than it adds ;-)
>> >>
>> >> I didn't see this in What's Cooking or pu. We should probably pick an
>> >> approach and go with it.
>> >>
>> >> I think using logmsg_reencode makes sense. I'd be in favor of avoiding
>> >> the extra copy in the common case, so something like the patch below. If
>> >> you feel strongly about the code cleanup at the minor run-time expense,
>> >> I won't argue too much, though.
>> >
>> > Sounds good to me.  Care to do the log message while at it?
>> 
>> Heh, how about this?  I still need a sign-off from you.
>
> I'm working on the log message and tests right now. There's also a minor
> code fixup needed to compile with -Wall.

Thanks; I noticed the constness issue around the message variable as well.

>
>>     log --grep: look for the given string in log output encoding
>>     
>>     We used to grep in the raw commit buffer contents, possibly pieces
>>     of notes encoded in log output encoding appended to it, which was
>>     insane.
>>     
>>     Convert the contents of the commit message also to log output
>>     encoding before looking for the string.  This incidentally fixes a
>>     possible NULL dereference that can happen when commit->buffer has
>>     already been freed, which can happen with
>>     
>>     	git commit -m 'text1' --allow-empty
>>     	git commit -m 'text2' --allow-empty
>>     	git log --graph --no-walk --grep 'text2'
>>     
>>     which arguably does not make any sense (--graph inherently wants a
>>     connected history, and by --no-walk the command line is telling us
>>     to show discrete points in history without connectivity), and we
>>     probably should forbid the combination, but that is a separate
>>     issue.
>
> I'll use bits of that. I had sort of punted on the "how to reproduce the
> segfault" issue entirely because you had noted that it was not a sane
> thing to do. Still, I think it makes sense to mention it with the caveat
> you give here.
>
> -Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] log: re-encode commit messages before grepping
  2013-02-11 20:01             ` Junio C Hamano
  2013-02-11 20:36               ` Junio C Hamano
@ 2013-02-11 20:59               ` Jeff King
  2013-02-11 21:11                 ` Junio C Hamano
  2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff King @ 2013-02-11 20:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.

Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.

As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:

        git commit -m 'text1' --allow-empty
        git commit -m 'text2' --allow-empty
        git log --graph --no-walk --grep 'text2'

which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.

Signed-off-by: Jeff King <peff@peff.net>
---
A few notes:

  1. I suppose we could also use $LANG or one of the $LC_* variables to
     guess at the encoding of the user's pattern. But I think using the
     output encoding makes the most sense, since then the pattern you
     searched for will actually be in the output.

  2. There are still problems with utf8 normalization. E.g., my tests
     represent utf-8 é with \xc3\xa9 (the code point for that glyph),
     but it could also be represented by \x65\xcc\x81 (e + combining
     acute). But that is not a new problem; it is an inherent issue with
     grepping utf8. We might in the future want to offer an option to
     normalize utf8 (or possibly the regex library can be taught to
     handle this).

  3. If the commit does need to be re-encoded, we end up doing so here,
     and then potentially again if we actually show the commit. So
     there may be some room to optimize that by stashing the re-encoded
     version somewhere (or it may not make a big difference, I haven't
     measured).

  4. I'm still not clear on why "--graph --no-walk" wants to look at
     commit_match after we have already cleared the commit buffer. I
     agree it's nonsensical, but I wonder if it might be a symptom of an
     underlying bug or inefficiency.

 revision.c          | 27 ++++++++++++++++++-------
 t/t4210-log-i18n.sh | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+), 7 deletions(-)
 create mode 100755 t/t4210-log-i18n.sh

diff --git a/revision.c b/revision.c
index d7562ee..ef60205 100644
--- a/revision.c
+++ b/revision.c
@@ -2268,7 +2268,10 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 static int commit_match(struct commit *commit, struct rev_info *opt)
 {
 	int retval;
+	const char *encoding;
+	char *message;
 	struct strbuf buf = STRBUF_INIT;
+
 	if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
 		return 1;
 
@@ -2279,13 +2282,23 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 		strbuf_addch(&buf, '\n');
 	}
 
+	/*
+	 * We grep in the user's output encoding, under the assumption that it
+	 * is the encoding they are most likely to write their grep pattern
+	 * for. In addition, it means we will match the "notes" encoding below,
+	 * so we will not end up with a buffer that has two different encodings
+	 * in it.
+	 */
+	encoding = get_log_output_encoding();
+	message = logmsg_reencode(commit, encoding);
+
 	/* Copy the commit to temporary if we are using "fake" headers */
 	if (buf.len)
-		strbuf_addstr(&buf, commit->buffer);
+		strbuf_addstr(&buf, message);
 
 	if (opt->grep_filter.header_list && opt->mailmap) {
 		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
+			strbuf_addstr(&buf, message);
 
 		commit_rewrite_person(&buf, "\nauthor ", opt->mailmap);
 		commit_rewrite_person(&buf, "\ncommitter ", opt->mailmap);
@@ -2294,18 +2307,18 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
 	/* Append "fake" message parts as needed */
 	if (opt->show_notes) {
 		if (!buf.len)
-			strbuf_addstr(&buf, commit->buffer);
-		format_display_notes(commit->object.sha1, &buf,
-				     get_log_output_encoding(), 1);
+			strbuf_addstr(&buf, message);
+		format_display_notes(commit->object.sha1, &buf, encoding, 1);
 	}
 
-	/* Find either in the commit object, or in the temporary */
+	/* Find either in the original commit message, or in the temporary */
 	if (buf.len)
 		retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
 	else
 		retval = grep_buffer(&opt->grep_filter,
-				     commit->buffer, strlen(commit->buffer));
+				     message, strlen(message));
 	strbuf_release(&buf);
+	logmsg_free(message, commit);
 	return retval;
 }
 
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
new file mode 100755
index 0000000..52a7472
--- /dev/null
+++ b/t/t4210-log-i18n.sh
@@ -0,0 +1,58 @@
+#!/bin/sh
+
+test_description='test log with i18n features'
+. ./test-lib.sh
+
+# two forms of é
+utf8_e=$(printf '\303\251')
+latin1_e=$(printf '\351')
+
+test_expect_success 'create commits in different encodings' '
+	test_tick &&
+	cat >msg <<-EOF &&
+	utf8
+
+	t${utf8_e}st
+	EOF
+	git add msg &&
+	git -c i18n.commitencoding=utf8 commit -F msg &&
+	cat >msg <<-EOF &&
+	latin1
+
+	t${latin1_e}st
+	EOF
+	git add msg &&
+	git -c i18n.commitencoding=ISO-8859-1 commit -F msg
+'
+
+test_expect_success 'log --grep searches in log output encoding (utf8)' '
+	cat >expect <<-\EOF &&
+	latin1
+	utf8
+	EOF
+	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'log --grep searches in log output encoding (latin1)' '
+	cat >expect <<-\EOF &&
+	latin1
+	utf8
+	EOF
+	git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
+	>expect &&
+	git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
+	>expect &&
+	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
1.8.1.2.11.g1a2f572

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] log: re-encode commit messages before grepping
  2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
@ 2013-02-11 21:11                 ` Junio C Hamano
  2013-02-11 21:14                   ` Jeff King
  2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-11 21:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Haller, Git List

Jeff King <peff@peff.net> writes:

>   1. I suppose we could also use $LANG or one of the $LC_* variables to
>      guess at the encoding of the user's pattern. But I think using the
>      output encoding makes the most sense, since then the pattern you
>      searched for will actually be in the output.

I agree.  In addition, if we were to do anything with LANG/LC_CTYPE,
it should be done at the layer that implements log-output-encoding
(e.g. lack of configured encoding with nonstandard LANG/LC_CTYPE
would use the locale, or something), I think.

>   2. There are still problems with utf8 normalization. E.g., my tests
>      represent utf-8 é with \xc3\xa9 (the code point for that glyph),
>      but it could also be represented by \x65\xcc\x81 (e + combining
>      acute). But that is not a new problem; it is an inherent issue with
>      grepping utf8. We might in the future want to offer an option to
>      normalize utf8 (or possibly the regex library can be taught to
>      handle this).

True; in either case, this caller (or any other callers) should
care.  Only grep_buffer() (actually, grep_source_1()) needs to be
taught about it.

>   4. I'm still not clear on why "--graph --no-walk" wants to look at
>      commit_match after we have already cleared the commit buffer. I
>      agree it's nonsensical, but I wonder if it might be a symptom of an
>      underlying bug or inefficiency.

Yeah, that may be something we may want to check, I agree.

The aded test is also nice.  Thanks.

> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
> new file mode 100755
> index 0000000..52a7472
> --- /dev/null
> +++ b/t/t4210-log-i18n.sh
> @@ -0,0 +1,58 @@
> +#!/bin/sh
> +
> +test_description='test log with i18n features'
> +. ./test-lib.sh
> +
> +# two forms of é
> +utf8_e=$(printf '\303\251')
> +latin1_e=$(printf '\351')
> +
> +test_expect_success 'create commits in different encodings' '
> +	test_tick &&
> +	cat >msg <<-EOF &&
> +	utf8
> +
> +	t${utf8_e}st
> +	EOF
> +	git add msg &&
> +	git -c i18n.commitencoding=utf8 commit -F msg &&
> +	cat >msg <<-EOF &&
> +	latin1
> +
> +	t${latin1_e}st
> +	EOF
> +	git add msg &&
> +	git -c i18n.commitencoding=ISO-8859-1 commit -F msg
> +'
> +
> +test_expect_success 'log --grep searches in log output encoding (utf8)' '
> +	cat >expect <<-\EOF &&
> +	latin1
> +	utf8
> +	EOF
> +	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep searches in log output encoding (latin1)' '
> +	cat >expect <<-\EOF &&
> +	latin1
> +	utf8
> +	EOF
> +	git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
> +	>expect &&
> +	git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
> +	>expect &&
> +	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] log: re-encode commit messages before grepping
  2013-02-11 21:11                 ` Junio C Hamano
@ 2013-02-11 21:14                   ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2013-02-11 21:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Thomas Haller, Git List

On Mon, Feb 11, 2013 at 01:11:24PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   1. I suppose we could also use $LANG or one of the $LC_* variables to
> >      guess at the encoding of the user's pattern. But I think using the
> >      output encoding makes the most sense, since then the pattern you
> >      searched for will actually be in the output.
> 
> I agree.  In addition, if we were to do anything with LANG/LC_CTYPE,
> it should be done at the layer that implements log-output-encoding
> (e.g. lack of configured encoding with nonstandard LANG/LC_CTYPE
> would use the locale, or something), I think.

Yeah, I had the same thought. I'll leave it to somebody who knows/cares
more about i18n. In my world view, utf8 is good enough for everyone. :)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
  2013-02-11 21:11                 ` Junio C Hamano
@ 2013-02-25  8:37                 ` Johannes Sixt
  2013-02-25 15:19                   ` Jeff King
  2013-02-25 18:54                   ` Torsten Bögershausen
  1 sibling, 2 replies; 26+ messages in thread
From: Johannes Sixt @ 2013-02-25  8:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Thomas Haller, Git List

From: Johannes Sixt <j6t@kdbg.org>

iconv on Windows does not know the encoding name "utf8", and does not
re-encode log messages when this name is given. Request "UTF-8" encoding.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 I'm not sure whether I'm right to say that "UTF-8" is the correct
 spelling. Anyway, 'iconv -l' on my old Linux box lists "UTF8", but on
 Windows it does not.

 A more correct fix would probably be to use is_encoding_utf8() in more
 places, but it's outside my time budget look after it.

 -- Hannes

 t/t4210-log-i18n.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 52a7472..b1956e2 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -15,7 +15,7 @@ test_expect_success 'create commits in different encodings' '
 	t${utf8_e}st
 	EOF
 	git add msg &&
-	git -c i18n.commitencoding=utf8 commit -F msg &&
+	git -c i18n.commitencoding=UTF-8 commit -F msg &&
 	cat >msg <<-EOF &&
 	latin1
 
@@ -30,7 +30,7 @@ test_expect_success 'log --grep searches in log output encoding (utf8)' '
 	latin1
 	utf8
 	EOF
-	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
+	git log --encoding=UTF-8 --format=%s --grep=$utf8_e >actual &&
 	test_cmp expect actual
 '
 
@@ -45,7 +45,7 @@ test_expect_success 'log --grep searches in log output encoding (latin1)' '
 
 test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
 	>expect &&
-	git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
+	git log --encoding=UTF-8 --format=%s --grep=$latin1_e >actual &&
 	test_cmp expect actual
 '
 
-- 
"Atomic objects are neither active nor radioactive." --
Programming Languages -- C++, Final Committee Draft (Doc.N3092)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
@ 2013-02-25 15:19                   ` Jeff King
  2013-02-25 19:06                     ` Junio C Hamano
  2013-02-25 21:00                     ` Torsten Bögershausen
  2013-02-25 18:54                   ` Torsten Bögershausen
  1 sibling, 2 replies; 26+ messages in thread
From: Jeff King @ 2013-02-25 15:19 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, Thomas Haller, Git List

On Mon, Feb 25, 2013 at 09:37:50AM +0100, Johannes Sixt wrote:

> From: Johannes Sixt <j6t@kdbg.org>
> 
> iconv on Windows does not know the encoding name "utf8", and does not
> re-encode log messages when this name is given. Request "UTF-8" encoding.
> 
> Signed-off-by: Johannes Sixt <j6t@kdbg.org>
> ---
>  I'm not sure whether I'm right to say that "UTF-8" is the correct
>  spelling. Anyway, 'iconv -l' on my old Linux box lists "UTF8", but on
>  Windows it does not.

UTF-8 is correct according to:

  https://en.wikipedia.org/wiki/Utf8#Official_name_and_variants

>  A more correct fix would probably be to use is_encoding_utf8() in more
>  places, but it's outside my time budget look after it.

Yeah, I wonder if this is a symptom of a deeper issue, which is that
utf-8 has many synonyms, and we would prefer to canonicalize the
encoding name before generating an object to avoid inconsistencies (of
course we cannot do so for every imaginable encoding, but utf-8 is a
pretty obvious one we handle already). We _should_ be generating commits
with no encoding header at all for utf-8, though.

And indeed, it looks like that is the case. commit_tree_extended has:

    /* Not having i18n.commitencoding is the same as having utf-8 */
    encoding_is_utf8 = is_encoding_utf8(git_commit_encoding);

    [...]

    if (!encoding_is_utf8)
            strbuf_addf(&buffer, "encoding %s\n", git_commit_encoding);


which makes me think that this first hunk...

> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
> index 52a7472..b1956e2 100755
> --- a/t/t4210-log-i18n.sh
> +++ b/t/t4210-log-i18n.sh
> @@ -15,7 +15,7 @@ test_expect_success 'create commits in different encodings' '
>  	t${utf8_e}st
>  	EOF
>  	git add msg &&
> -	git -c i18n.commitencoding=utf8 commit -F msg &&
> +	git -c i18n.commitencoding=UTF-8 commit -F msg &&
>  	cat >msg <<-EOF &&
>  	latin1

...should be a no-op; the utf8 there should never be seen by anybody but
git. Can you confirm that is the case?

> @@ -30,7 +30,7 @@ test_expect_success 'log --grep searches in log output encoding (utf8)' '
>  	latin1
>  	utf8
>  	EOF
> -	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
> +	git log --encoding=UTF-8 --format=%s --grep=$utf8_e >actual &&
>  	test_cmp expect actual
>  '

This one will feed it to iconv, though, because the latin1 commit will
need to be re-encoded. I think the simplest thing would just be:

diff --git a/utf8.c b/utf8.c
index 1087870..8d42b50 100644
--- a/utf8.c
+++ b/utf8.c
@@ -507,6 +507,17 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 
 	if (!in_encoding)
 		return NULL;
+
+	/*
+	 * Some platforms do not have the variously spelled variants of
+	 * UTF-8, so let us feed iconv the most official spelling, which
+	 * should hopefully be accepted everywhere.
+	 */
+	if (is_encoding_utf8(in_encoding))
+		in_encoding = "UTF-8";
+	if (is_encoding_utf8(out_encoding))
+		out_encoding = "UTF-8";
+
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1)
 		return NULL;

Does that fix the tests for you? It's a larger change, but I think it
makes git friendlier all around for people on Windows.

-Peff

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
  2013-02-25 15:19                   ` Jeff King
@ 2013-02-25 18:54                   ` Torsten Bögershausen
  2013-02-25 20:36                     ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Torsten Bögershausen @ 2013-02-25 18:54 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Jeff King, Junio C Hamano, Thomas Haller, Git List,
	Torsten Bögershausen

On 25.02.13 09:37, Johannes Sixt wrote:
> From: Johannes Sixt <j6t@kdbg.org>
> 
> iconv on Windows does not know the encoding name "utf8", and does not
> re-encode log messages when this name is given. Request "UTF-8" encoding.
> 
> Signed-off-by: Johannes Sixt <j6t@kdbg.org>
> ---
>  I'm not sure whether I'm right to say that "UTF-8" is the correct
>  spelling. Anyway, 'iconv -l' on my old Linux box lists "UTF8", but on
>  Windows it does not.
> 
>  A more correct fix would probably be to use is_encoding_utf8() in more
>  places, but it's outside my time budget look after it.
> 
>  -- Hannes
> 
>  t/t4210-log-i18n.sh | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
> index 52a7472..b1956e2 100755
> --- a/t/t4210-log-i18n.sh
> +++ b/t/t4210-log-i18n.sh
> @@ -15,7 +15,7 @@ test_expect_success 'create commits in different encodings' '
>  	t${utf8_e}st
>  	EOF
>  	git add msg &&
> -	git -c i18n.commitencoding=utf8 commit -F msg &&
> +	git -c i18n.commitencoding=UTF-8 commit -F msg &&
>  	cat >msg <<-EOF &&
>  	latin1
>  
> @@ -30,7 +30,7 @@ test_expect_success 'log --grep searches in log output encoding (utf8)' '
>  	latin1
>  	utf8
>  	EOF
> -	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
> +	git log --encoding=UTF-8 --format=%s --grep=$utf8_e >actual &&
>  	test_cmp expect actual
>  '
>  
> @@ -45,7 +45,7 @@ test_expect_success 'log --grep searches in log output encoding (latin1)' '
>  
>  test_expect_success 'log --grep does not find non-reencoded values (utf8)' '
>  	>expect &&
> -	git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
> +	git log --encoding=UTF-8 --format=%s --grep=$latin1_e >actual &&
>  	test_cmp expect actual
>  '
>  
> 
Hej,
(beside that I couldn't find t4210 somewhere),
is it something like the following you are tinking of?

(Not sure if my cut-and-paste stuff applies, its's rather for review)

-- >8 --
[PATCH] iconv_open(): Use UTF-8 if UTF8 failes

When iconv_open() failes with EINVAL, it may be that "UTF-8"
is spelled wrong by mistake.
For example, "UTF8" is used instead of "UTF-8".
Some iconv implementations tolerate "UTF8" or "utf8".
If not, iconv_open() fails.
If is_encoding_utf8() is true change the string to the
offical string "UTF-8" with uppercase letters.

Reported-By: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/utf8.c b/utf8.c
index a4ee665..e9850d0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -487,6 +487,10 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 	if (!in_encoding)
 		return NULL;
 	conv = iconv_open(out_encoding, in_encoding);
+	if (conv == (iconv_t) -1 && errno == EINVAL) {
+		conv = iconv_open(is_encoding_utf8(out_encoding) ? "UTF-8" : out_encoding,
+											is_encoding_utf8(in_encoding) ? "UTF-8" : in_encoding);
+	}
 	if (conv == (iconv_t) -1)
 		return NULL;
 	out = reencode_string_iconv(in, strlen(in), conv);
-- 
1.8.1.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25 15:19                   ` Jeff King
@ 2013-02-25 19:06                     ` Junio C Hamano
  2013-02-25 20:31                       ` Jeff King
  2013-02-25 21:00                     ` Torsten Bögershausen
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2013-02-25 19:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Sixt, Thomas Haller, Git List

Jeff King <peff@peff.net> writes:

> ... I think the simplest thing would just be:
>
> diff --git a/utf8.c b/utf8.c
> index 1087870..8d42b50 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -507,6 +507,17 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
>  
>  	if (!in_encoding)
>  		return NULL;
> +
> +	/*
> +	 * Some platforms do not have the variously spelled variants of
> +	 * UTF-8, so let us feed iconv the most official spelling, which
> +	 * should hopefully be accepted everywhere.
> +	 */
> +	if (is_encoding_utf8(in_encoding))
> +		in_encoding = "UTF-8";
> +	if (is_encoding_utf8(out_encoding))
> +		out_encoding = "UTF-8";
> +
>  	conv = iconv_open(out_encoding, in_encoding);
>  	if (conv == (iconv_t) -1)
>  		return NULL;
>
> Does that fix the tests for you? It's a larger change, but I think it
> makes git friendlier all around for people on Windows.

Yeah, if this is confirmed to work OK (from eyeballing I do not see
a reason why not...) I agree this is the cleanest way forward.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25 19:06                     ` Junio C Hamano
@ 2013-02-25 20:31                       ` Jeff King
  2013-02-26  6:47                         ` Johannes Sixt
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2013-02-25 20:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Sixt, Thomas Haller, Git List

On Mon, Feb 25, 2013 at 11:06:37AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > ... I think the simplest thing would just be:
> >
> > diff --git a/utf8.c b/utf8.c
> > index 1087870..8d42b50 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -507,6 +507,17 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
> >  
> >  	if (!in_encoding)
> >  		return NULL;
> > +
> > +	/*
> > +	 * Some platforms do not have the variously spelled variants of
> > +	 * UTF-8, so let us feed iconv the most official spelling, which
> > +	 * should hopefully be accepted everywhere.
> > +	 */
> > +	if (is_encoding_utf8(in_encoding))
> > +		in_encoding = "UTF-8";
> > +	if (is_encoding_utf8(out_encoding))
> > +		out_encoding = "UTF-8";
> > +
> >  	conv = iconv_open(out_encoding, in_encoding);
> >  	if (conv == (iconv_t) -1)
> >  		return NULL;
> >
> > Does that fix the tests for you? It's a larger change, but I think it
> > makes git friendlier all around for people on Windows.
> 
> Yeah, if this is confirmed to work OK (from eyeballing I do not see
> a reason why not...) I agree this is the cleanest way forward.

The only reason I can think of is that you specify "utf8", your platform
understands "utf8" but not "UTF-8", and we rewrite it silently to
"UTF-8". That seems somewhat unlikely, but to be on the safe side, why
don't we do it as a fallback? That should be fine performance-wise, as
it only triggers on the error case.

Like this:

-- >8 --
Subject: [PATCH] utf8: accept alternate spellings of UTF-8

The iconv implementation on many platforms will accept
variants of UTF-8, including "UTF8", "utf-8", and "utf8",
but some do not. We make allowances in our code to treat
them all identically, but we sometimes hand the string from
the user directly to iconv. In this case, the platform iconv
may or may not work.

There are really four levels of platform iconv support for
these synonyms:

  1. All synonyms understood (e.g., glibc).

  2. Only the official "UTF-8" understood (e.g., Windows).

  3. Official "UTF-8" not understood, but some other synonym
     understood (it's not known whether such a platform exists).

  4. Neither "UTF-8" nor any synonym understood (e.g.,
     ancient systems, or ones without utf8 support
     installed).

This patch teaches git to fall back to using the official
"UTF-8" spelling when iconv_open fails (and the encoding was
one of the synonym spellings). This makes things more
convenient to users of type 2 systems, as they can now use
any of the synonyms for the log output encoding.

Type 1 systems are not affected, as iconv already works on
the first try.

Type 4 systems are not affected, as both attempts already
fail.

Type 3 systems will not benefit from the feature, but
because we only use "UTF-8" as a fallback, they will not be
regressed (i.e., you can continue to use "utf8" if your
platform supports it). We could try all the various
synonyms, but since such systems are not even known to
exist, it's not worth the effort.

Signed-off-by: Jeff King <peff@peff.net>
---
JSixt, can you double-check that this passes t4210 for you?

 utf8.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index 1087870..8f6e84b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -507,9 +507,25 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 
 	if (!in_encoding)
 		return NULL;
+
 	conv = iconv_open(out_encoding, in_encoding);
-	if (conv == (iconv_t) -1)
-		return NULL;
+	if (conv == (iconv_t) -1) {
+		/*
+		 * Some platforms do not have the variously spelled variants of
+		 * UTF-8, so let's fall back to trying the most official
+		 * spelling. We do so only as a fallback in case the platform
+		 * does understand the user's spelling, but not our official
+		 * one.
+		 */
+		if (is_encoding_utf8(in_encoding))
+			in_encoding = "UTF-8";
+		if (is_encoding_utf8(out_encoding))
+			out_encoding = "UTF-8";
+		conv = iconv_open(out_encoding, in_encoding);
+		if (conv == (iconv_t) -1)
+			return NULL;
+	}
+
 	out = reencode_string_iconv(in, strlen(in), conv);
 	iconv_close(conv);
 	return out;
-- 
1.8.1.4.4.g265d2fa

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25 18:54                   ` Torsten Bögershausen
@ 2013-02-25 20:36                     ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2013-02-25 20:36 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Johannes Sixt, Junio C Hamano, Thomas Haller, Git List

On Mon, Feb 25, 2013 at 07:54:47PM +0100, Torsten Bögershausen wrote:

> (beside that I couldn't find t4210 somewhere),

It's newly added in 04deccd.

> diff --git a/utf8.c b/utf8.c
> index a4ee665..e9850d0 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -487,6 +487,10 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
>  	if (!in_encoding)
>  		return NULL;
>  	conv = iconv_open(out_encoding, in_encoding);
> +	if (conv == (iconv_t) -1 && errno == EINVAL) {
> +		conv = iconv_open(is_encoding_utf8(out_encoding) ? "UTF-8" : out_encoding,
> +											is_encoding_utf8(in_encoding) ? "UTF-8" : in_encoding);
> +	}

Yeah, I think this is pretty close to the patch I just sent. You do the
fallback, which I think is sane to prevent any possible regression (for
"type 3" systems, as I called them in my commit). I don't know if it is
worth checking errno; we should get the same failure either way, and I'd
worry slightly about some odd iconv implementation does not use EINVAL.

And of course, I like my indentation and commit message better. :)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25 15:19                   ` Jeff King
  2013-02-25 19:06                     ` Junio C Hamano
@ 2013-02-25 21:00                     ` Torsten Bögershausen
  1 sibling, 0 replies; 26+ messages in thread
From: Torsten Bögershausen @ 2013-02-25 21:00 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Sixt, Junio C Hamano, Thomas Haller, Git List

On 25.02.13 16:19, Jeff King wrote:
> On Mon, Feb 25, 2013 at 09:37:50AM +0100, Johannes Sixt wrote:
> 
>> From: Johannes Sixt <j6t@kdbg.org>
>>
>> iconv on Windows does not know the encoding name "utf8", and does not
>> re-encode log messages when this name is given. Request "UTF-8" encoding.
>>
>> Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>> ---
>>  I'm not sure whether I'm right to say that "UTF-8" is the correct
>>  spelling. Anyway, 'iconv -l' on my old Linux box lists "UTF8", but on
>>  Windows it does not.
> 
> UTF-8 is correct according to:
> 
>   https://en.wikipedia.org/wiki/Utf8#Official_name_and_variants
> 
>>  A more correct fix would probably be to use is_encoding_utf8() in more
>>  places, but it's outside my time budget look after it.
> 
> Yeah, I wonder if this is a symptom of a deeper issue, which is that
> utf-8 has many synonyms, and we would prefer to canonicalize the
> encoding name before generating an object to avoid inconsistencies (of
> course we cannot do so for every imaginable encoding, but utf-8 is a
> pretty obvious one we handle already). We _should_ be generating commits
> with no encoding header at all for utf-8, though.
> 
> And indeed, it looks like that is the case. commit_tree_extended has:
> 
>     /* Not having i18n.commitencoding is the same as having utf-8 */
>     encoding_is_utf8 = is_encoding_utf8(git_commit_encoding);
> 
>     [...]
> 
>     if (!encoding_is_utf8)
>             strbuf_addf(&buffer, "encoding %s\n", git_commit_encoding);
> 
> 
> which makes me think that this first hunk...
> 
>> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
>> index 52a7472..b1956e2 100755
>> --- a/t/t4210-log-i18n.sh
>> +++ b/t/t4210-log-i18n.sh
>> @@ -15,7 +15,7 @@ test_expect_success 'create commits in different encodings' '
>>  	t${utf8_e}st
>>  	EOF
>>  	git add msg &&
>> -	git -c i18n.commitencoding=utf8 commit -F msg &&
>> +	git -c i18n.commitencoding=UTF-8 commit -F msg &&
>>  	cat >msg <<-EOF &&
>>  	latin1
> 
> ...should be a no-op; the utf8 there should never be seen by anybody but
> git. Can you confirm that is the case?
> 
>> @@ -30,7 +30,7 @@ test_expect_success 'log --grep searches in log output encoding (utf8)' '
>>  	latin1
>>  	utf8
>>  	EOF
>> -	git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
>> +	git log --encoding=UTF-8 --format=%s --grep=$utf8_e >actual &&
>>  	test_cmp expect actual
>>  '
> 
> This one will feed it to iconv, though, because the latin1 commit will
> need to be re-encoded. I think the simplest thing would just be:
> 
> diff --git a/utf8.c b/utf8.c
> index 1087870..8d42b50 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -507,6 +507,17 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
>  
>  	if (!in_encoding)
>  		return NULL;
> +
> +	/*
> +	 * Some platforms do not have the variously spelled variants of
> +	 * UTF-8, so let us feed iconv the most official spelling, which
> +	 * should hopefully be accepted everywhere.
> +	 */
> +	if (is_encoding_utf8(in_encoding))
> +		in_encoding = "UTF-8";
> +	if (is_encoding_utf8(out_encoding))
> +		out_encoding = "UTF-8";
> +
>  	conv = iconv_open(out_encoding, in_encoding);
>  	if (conv == (iconv_t) -1)
>  		return NULL;
> 
> Does that fix the tests for you? It's a larger change, but I think it
> makes git friendlier all around for people on Windows.
> 
> -Peff
> --
 
Thanks, I'm OK with your version.

And a test on cygwin was OK for the new t4210.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
  2013-02-25 20:31                       ` Jeff King
@ 2013-02-26  6:47                         ` Johannes Sixt
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Sixt @ 2013-02-26  6:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Thomas Haller, Git List

Am 2/25/2013 21:31, schrieb Jeff King:
> Subject: [PATCH] utf8: accept alternate spellings of UTF-8
> ...
> JSixt, can you double-check that this passes t4210 for you?

Yes, t4210 and all other *i18n* and *log* tests pass with this patch.

Thanks,
-- Hannes

> 
>  utf8.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/utf8.c b/utf8.c
> index 1087870..8f6e84b 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -507,9 +507,25 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
>  
>  	if (!in_encoding)
>  		return NULL;
> +
>  	conv = iconv_open(out_encoding, in_encoding);
> -	if (conv == (iconv_t) -1)
> -		return NULL;
> +	if (conv == (iconv_t) -1) {
> +		/*
> +		 * Some platforms do not have the variously spelled variants of
> +		 * UTF-8, so let's fall back to trying the most official
> +		 * spelling. We do so only as a fallback in case the platform
> +		 * does understand the user's spelling, but not our official
> +		 * one.
> +		 */
> +		if (is_encoding_utf8(in_encoding))
> +			in_encoding = "UTF-8";
> +		if (is_encoding_utf8(out_encoding))
> +			out_encoding = "UTF-8";
> +		conv = iconv_open(out_encoding, in_encoding);
> +		if (conv == (iconv_t) -1)
> +			return NULL;
> +	}
> +
>  	out = reencode_string_iconv(in, strlen(in), conv);
>  	iconv_close(conv);
>  	return out;

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-02-26  6:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-08 23:52 segfault for git log --graph --no-walk --grep a Thomas Haller
2013-02-09  0:05 ` Junio C Hamano
2013-02-09  0:22   ` Junio C Hamano
2013-02-09  0:27     ` Jeff King
2013-02-09  0:39       ` Junio C Hamano
2013-02-09  0:47         ` Junio C Hamano
2013-02-09  1:05           ` Jeff King
2013-02-09  1:08             ` Jeff King
2013-02-11 19:16           ` Jeff King
2013-02-11 20:01             ` Junio C Hamano
2013-02-11 20:36               ` Junio C Hamano
2013-02-11 20:41                 ` Jeff King
2013-02-11 20:55                   ` Junio C Hamano
2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
2013-02-11 21:11                 ` Junio C Hamano
2013-02-11 21:14                   ` Jeff King
2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
2013-02-25 15:19                   ` Jeff King
2013-02-25 19:06                     ` Junio C Hamano
2013-02-25 20:31                       ` Jeff King
2013-02-26  6:47                         ` Johannes Sixt
2013-02-25 21:00                     ` Torsten Bögershausen
2013-02-25 18:54                   ` Torsten Bögershausen
2013-02-25 20:36                     ` Jeff King
2013-02-09  0:29     ` segfault for git log --graph --no-walk --grep a Junio C Hamano
2013-02-09  0:39       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).