git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH] commit: Warn about encodings unsupported by iconv.
@ 2008-10-20 21:25 Alexander Gavrilov
  2008-10-21  0:39 ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Gavrilov @ 2008-10-20 21:25 UTC (permalink / raw
  To: git; +Cc: Junio C Hamano, Paul Mackerras

Currently git-commit and git-commit-tree silently allow
usage of encodings that are unknown to iconv. This may
confuse the user, who then won't be able to use automatic
encoding conversion in git-log and friends without any
immediately obvious reason. Note that the difference
between a supported and an unsupported name may be as
small as CP1251 vs CP-1251, or Shift-JIS vs ShiftJIS.

This commit adds a check for such cases, which produces
a warning similar to the one issued when a message claims
to be utf-8, but actually is not.

Signed-off-by: Alexander Gavrilov <angavrilov@gmail.com>
---

	I wonder how common such situation may actually be, and 
	whether gitk & git-gui (or core git itself) should explicitly
	provide some way to deal with it in old commits. I personally
	hit it during experimenting, and wrongly concluded that gitk
	does not support using multiple encodings in one repository.
	
	Current gitk implementation generally allows working around
	it by setting i18n.commitencoding to a valid name of the
	encoding used in the mislabeled commits. However, the
	downside is that if the selected encoding cannot represent
	some characters of an otherwise completely valid commit,
	they come out as garbage. Always using --encoding=utf-8
	from GUI and relying on conversion done by git-log should
	fix this case, but it breaks the workaround.
	
	-- Alexander

 builtin-commit-tree.c |   25 +++++++++++++++++++++++--
 1 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
index 0453425..7f325a3 100644
--- a/builtin-commit-tree.c
+++ b/builtin-commit-tree.c
@@ -45,6 +45,28 @@ static const char commit_utf8_warn[] =
 "You may want to amend it after fixing the message, or set the config\n"
 "variable i18n.commitencoding to the encoding your project uses.\n";
 
+static const char commit_bad_encoding_warn[] =
+"Warning: commit encoding '%s' is not supported.\n"
+"You may want to change the value of the i18n.commitencoding config\n"
+"variable, and redo the commit. Use 'iconv --list' to see the list of\n"
+"available encoding names.\n";
+
+static void verify_commit_encoding(const char *text, const char *encoding)
+{
+	if (is_encoding_utf8(encoding)) {
+		if (!is_utf8(text))
+			fprintf(stderr, commit_utf8_warn);
+	}
+#ifndef NO_ICONV
+	else {
+		char *conv = reencode_string("", "utf-8", encoding);
+		if (!conv)
+			fprintf(stderr, commit_bad_encoding_warn, encoding);
+		free(conv);
+	}
+#endif
+}
+
 int commit_tree(const char *msg, unsigned char *tree,
 		struct commit_list *parents, unsigned char *ret,
 		const char *author)
@@ -87,8 +109,7 @@ int commit_tree(const char *msg, unsigned char *tree,
 	strbuf_addstr(&buffer, msg);
 
 	/* And check the encoding */
-	if (encoding_is_utf8 && !is_utf8(buffer.buf))
-		fprintf(stderr, commit_utf8_warn);
+	verify_commit_encoding(buffer.buf, git_commit_encoding);
 
 	result = write_sha1_file(buffer.buf, buffer.len, commit_type, ret);
 	strbuf_release(&buffer);
-- 
1.6.0.20.g6148bc

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] commit: Warn about encodings unsupported by iconv.
  2008-10-20 21:25 [RFC PATCH] commit: Warn about encodings unsupported by iconv Alexander Gavrilov
@ 2008-10-21  0:39 ` Junio C Hamano
  2008-10-21  6:17   ` Alex Riesen
  2008-10-21 10:43   ` Alexander Gavrilov
  0 siblings, 2 replies; 4+ messages in thread
From: Junio C Hamano @ 2008-10-21  0:39 UTC (permalink / raw
  To: Alexander Gavrilov; +Cc: git, Paul Mackerras

Alexander Gavrilov <angavrilov@gmail.com> writes:

> diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
> index 0453425..7f325a3 100644
> --- a/builtin-commit-tree.c
> +++ b/builtin-commit-tree.c
> @@ -45,6 +45,28 @@ static const char commit_utf8_warn[] =
>  "You may want to amend it after fixing the message, or set the config\n"
>  "variable i18n.commitencoding to the encoding your project uses.\n";
>  
> +static const char commit_bad_encoding_warn[] =
> +"Warning: commit encoding '%s' is not supported.\n"
> +"You may want to change the value of the i18n.commitencoding config\n"
> +"variable, and redo the commit. Use 'iconv --list' to see the list of\n"
> +"available encoding names.\n";
> +
> +static void verify_commit_encoding(const char *text, const char *encoding)
> +{
> +	if (is_encoding_utf8(encoding)) {
> +		if (!is_utf8(text))
> +			fprintf(stderr, commit_utf8_warn);
> +	}
> +#ifndef NO_ICONV
> +	else {
> +		char *conv = reencode_string("", "utf-8", encoding);
> +		if (!conv)
> +			fprintf(stderr, commit_bad_encoding_warn, encoding);
> +		free(conv);
> +	}
> +#endif
> +}
> +

I think the issue you are attempting to tackle is worth addressing, but I
am not sure if this is the best approach.

Commit_tree() does not re-encode the payload.

It just marks it with the encoding header.  Wouldn't that mean that it
should be possible for you to create a commit with its message encoded in
KOI-8, and mark the resulting commit as encoded as such, on a host that is
incapable of actually transcoding from KOI-8 to utf-8?  IOW, your being
able to encode from i18n.commitencoding to utf-8 does not have much to do
with the validity of the configuration variable.

It would clarify the issues to think about what this new code would do on
a host without iconv, if you do not have the above #ifndef/#endif pair.
The replacement reencode_string() implementation always returns NULL, so
the code will always warn.

I am guessing that the reason you added #ifndef/#endif is because what the
warning message says is incorrect.

 * "is not supported" is not correct.  "is not supported HERE" may be.

 * "is not supported" (nor "is not supported HERE") does not matter.  It
   is log-reading side that does the re-encoding, not the commit
   generating side.

 * what you would really want to say is "might be incorrectly spelled",
   but your problem is that you do not have a direct way to check that.

Another reason of your "#ifndef/#endif" would be that there is no way to
squelch the warning message after seeing it on a NO_ICONV platform.

But that suggests that the "#ifndef/#endif" is not a good way to squelch
the message.  What would you do, after seeing the warning message and
examining the situation, you know KOI-8 is a valid encoding name, your
editor is recording what you type in the commit log message in KOI-8, you
know you set i18n.commitencoding to KOI-8, and you know somehow your
system is incapable of reencode_string("", "utf-8" "KOI-8")?  There is no
way to squelch the message.

So perhaps you would need some "state" variable that says "I know this
i18n.commitencoding configuration is valid" if you go this route?  But the
reason for "I know" would be either (1) because we earlier tried
reencode_string() and it resulted in an Ok return, or (2) because the user
verified that the configuration is valid, even though on this particular
system it cannot be encoded to utf-8.  IOW, the latter one would be
"because the user tells us so" --- which would be the same as trusting
i18n.commitencoding from the beginning.  I dunno.

I actually have an alternative approach to suggest.

How about adding a new i18n.commit-reencode-logmessage option (boolean),
and when it is set, we actually re-encode from i18n.commitencoding to
"utf-8" before creating the commit object (and obviously we do not store
"encoding" header in the resulting commit)?  When the conversion fails, we
know it failed, so the warning you added does make sense in that context.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] commit: Warn about encodings unsupported by iconv.
  2008-10-21  0:39 ` Junio C Hamano
@ 2008-10-21  6:17   ` Alex Riesen
  2008-10-21 10:43   ` Alexander Gavrilov
  1 sibling, 0 replies; 4+ messages in thread
From: Alex Riesen @ 2008-10-21  6:17 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Alexander Gavrilov, git, Paul Mackerras

Junio C Hamano, Tue, Oct 21, 2008 02:39:36 +0200:
> I actually have an alternative approach to suggest.
> 
> How about adding a new i18n.commit-reencode-logmessage option (boolean),
> and when it is set, we actually re-encode from i18n.commitencoding to
> "utf-8" before creating the commit object (and obviously we do not store
> "encoding" header in the resulting commit)?  When the conversion fails, we
> know it failed, so the warning you added does make sense in that context.

Maybe make the option a string, and allow to choose the target
encoding (not only utf8, but anything user wishes)?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] commit: Warn about encodings unsupported by iconv.
  2008-10-21  0:39 ` Junio C Hamano
  2008-10-21  6:17   ` Alex Riesen
@ 2008-10-21 10:43   ` Alexander Gavrilov
  1 sibling, 0 replies; 4+ messages in thread
From: Alexander Gavrilov @ 2008-10-21 10:43 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Paul Mackerras

On Tue, Oct 21, 2008 at 4:39 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Alexander Gavrilov <angavrilov@gmail.com> writes:
>
>> diff --git a/builtin-commit-tree.c b/builtin-commit-tree.c
>> index 0453425..7f325a3 100644
>> --- a/builtin-commit-tree.c
>> +++ b/builtin-commit-tree.c
>> @@ -45,6 +45,28 @@ static const char commit_utf8_warn[] =
>>  "You may want to amend it after fixing the message, or set the config\n"
>>  "variable i18n.commitencoding to the encoding your project uses.\n";
>>
>> +static const char commit_bad_encoding_warn[] =
>> +"Warning: commit encoding '%s' is not supported.\n"
>> +"You may want to change the value of the i18n.commitencoding config\n"
>> +"variable, and redo the commit. Use 'iconv --list' to see the list of\n"
>> +"available encoding names.\n";
>> +
>> +static void verify_commit_encoding(const char *text, const char *encoding)
>> +{
>> +     if (is_encoding_utf8(encoding)) {
>> +             if (!is_utf8(text))
>> +                     fprintf(stderr, commit_utf8_warn);
>> +     }
>> +#ifndef NO_ICONV
>> +     else {
>> +             char *conv = reencode_string("", "utf-8", encoding);
>> +             if (!conv)
>> +                     fprintf(stderr, commit_bad_encoding_warn, encoding);
>> +             free(conv);
>> +     }
>> +#endif
>> +}
>> +

> It just marks it with the encoding header.  Wouldn't that mean that it
> should be possible for you to create a commit with its message encoded in
> KOI-8, and mark the resulting commit as encoded as such, on a host that is
> incapable of actually transcoding from KOI-8 to utf-8?  IOW, your being
> able to encode from i18n.commitencoding to utf-8 does not have much to do
> with the validity of the configuration variable.

It may be possible to check for reencode_string("", encoding,
encoding). IConv should be able to do the identity conversion for any
valid encoding. I checked with the iconv command-line executable on my
system, and it correctly errors out on invalid names, and agrees to do
the conversion for valid ones.

> It would clarify the issues to think about what this new code would do on
> a host without iconv, if you do not have the above #ifndef/#endif pair.
> The replacement reencode_string() implementation always returns NULL, so
> the code will always warn.
>
> I am guessing that the reason you added #ifndef/#endif is because what the
> warning message says is incorrect.
>
>  * "is not supported" is not correct.  "is not supported HERE" may be.
>
>  * "is not supported" (nor "is not supported HERE") does not matter.  It
>   is log-reading side that does the re-encoding, not the commit
>   generating side.

The trouble is, by the time it gets to the "log-reading" side, it may
be too late to do anything, because the commits have already been
created, hashed and pushed out somewhere. And the originating side
won't notice because they would actually expect to get the messages
without conversion from their git-log. It may be amended somewhat if
gitk insists on receiving all data from git-log in utf-8 unless
explicitly told otherwise; on the other hand it may make people think
that the problem is in gitk...

>  * what you would really want to say is "might be incorrectly spelled",
>   but your problem is that you do not have a direct way to check that.
>
> Another reason of your "#ifndef/#endif" would be that there is no way to
> squelch the warning message after seeing it on a NO_ICONV platform.
>
> But that suggests that the "#ifndef/#endif" is not a good way to squelch
> the message.  What would you do, after seeing the warning message and
> examining the situation, you know KOI-8 is a valid encoding name, your
> editor is recording what you type in the commit log message in KOI-8, you
> know you set i18n.commitencoding to KOI-8, and you know somehow your
> system is incapable of reencode_string("", "utf-8" "KOI-8")?  There is no
> way to squelch the message.

A bit of a problem is that unless git implements its own encoding
alias table, the only authority on valid encoding names is iconv. So
checking would still involve running reencode_string, or looking at
iconv --list, possibly on another machine.

> So perhaps you would need some "state" variable that says "I know this
> i18n.commitencoding configuration is valid" if you go this route?  But the
> reason for "I know" would be either (1) because we earlier tried
> reencode_string() and it resulted in an Ok return, or (2) because the user
> verified that the configuration is valid, even though on this particular
> system it cannot be encoded to utf-8.  IOW, the latter one would be
> "because the user tells us so" --- which would be the same as trusting
> i18n.commitencoding from the beginning.  I dunno.

Maybe if the option is called something like i18n.brokenIconv, people
would understand that it should not be enabled in normal
circumstances? It can also be used to tell GUI tools that git-log
cannot be trusted to do the conversion properly -- perhaps being
enabled system-wide by make install if NO_ICONV is set.

> I actually have an alternative approach to suggest.
>
> How about adding a new i18n.commit-reencode-logmessage option (boolean),
> and when it is set, we actually re-encode from i18n.commitencoding to
> "utf-8" before creating the commit object (and obviously we do not store
> "encoding" header in the resulting commit)?  When the conversion fails, we
> know it failed, so the warning you added does make sense in that context.

This option changes the way data is actually processed before storing,
so it probably cannot be turned on by default -- while the check is
nearly useless _unless_ it is turned on by default.

Alexander

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-10-21 10:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-20 21:25 [RFC PATCH] commit: Warn about encodings unsupported by iconv Alexander Gavrilov
2008-10-21  0:39 ` Junio C Hamano
2008-10-21  6:17   ` Alex Riesen
2008-10-21 10:43   ` Alexander Gavrilov

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).