[PATCH] utf8.c: print warning about disabled iconv

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] utf8.c: print warning about disabled iconv
@ 2015-06-06 21:02 Max Kirillov
  2015-06-08 16:16 ` Junio C Hamano
  2015-08-14 21:55 ` [PATCH v2] utf8.c: print warning about iconv errors Max Kirillov
  0 siblings, 2 replies; 8+ messages in thread
From: Max Kirillov @ 2015-06-06 21:02 UTC (permalink / raw)
  To: Eric Sunshine, Jeff King, Junio C Hamano; +Cc: Max Kirillov, git

It is an allowed compile-time option to build git without iconv
support. Resulting build almost always functions correctly, and
never displays that it is missing anything, but reencode_string_len()
just never modifies its input. This gives undesirable result that
returned data or even data written into repository is incorrect
and user is not aware about it.

Show warning there is non-trivial reencoding requested. Show it only once
during program run.

Signed-off-by: Max Kirillov <max@max630.net>
---
Hi.

I have noticed there is some activity around the option, so it's not completely
abandoned. Maybe then this patch could be useful?
 utf8.c | 13 +++++++++++++
 utf8.h |  5 ++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/utf8.c b/utf8.c
index 520fbb4..b1b836e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -521,6 +521,19 @@ char *reencode_string_len(const char *in, int insz,
 	iconv_close(conv);
 	return out;
 }
+#else
+static int noiconv_warning_shown = 0;
+
+char *reencode_string_len(const char *in, int insz,
+			  const char *out_encoding, const char *in_encoding,
+			  int *outsz)
+{
+	if (!same_encoding(in_encoding, out_encoding) && !noiconv_warning_shown) {
+		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
+		noiconv_warning_shown = 1;
+	}
+	return NULL;
+}
 #endif
 
 /*
diff --git a/utf8.h b/utf8.h
index e4d9183..3d900f1 100644
--- a/utf8.h
+++ b/utf8.h
@@ -23,13 +23,12 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,
 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
 			    iconv_t conv, int *outsz);
+#endif
+
 char *reencode_string_len(const char *in, int insz,
 			  const char *out_encoding,
 			  const char *in_encoding,
 			  int *outsz);
-#else
-#define reencode_string_len(a,b,c,d,e) NULL
-#endif
 
 static inline char *reencode_string(const char *in,
 				    const char *out_encoding,
-- 
2.3.4.2801.g3d0809b

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] utf8.c: print warning about disabled iconv
  2015-06-06 21:02 [PATCH] utf8.c: print warning about disabled iconv Max Kirillov
@ 2015-06-08 16:16 ` Junio C Hamano
  2015-06-08 21:07   ` Max Kirillov
  2015-08-14 21:55 ` [PATCH v2] utf8.c: print warning about iconv errors Max Kirillov
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2015-06-08 16:16 UTC (permalink / raw)
  To: Max Kirillov; +Cc: Eric Sunshine, Jeff King, git

Max Kirillov <max@max630.net> writes:

> It is an allowed compile-time option to build git without iconv
> support. Resulting build almost always functions correctly, and
> never displays that it is missing anything, but reencode_string_len()
> just never modifies its input.

Correct.

> This gives undesirable result that
> returned data or even data written into repository is incorrect
> and user is not aware about it.

I do not necessarily agree with that.  The user knows what s/he is
doing, data written to or shown from the repository is correct as
far as the user is concerned, and the user takes the full
respoinsibility when compiling out certain features.

> +	if (!same_encoding(in_encoding, out_encoding) && !noiconv_warning_shown) {
> +		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
> +		noiconv_warning_shown = 1;
> +	}

I actually am OK if the user gets exactly the same warning between
the two cases:

 - iconv failed to convert in the real reencode_string_len()

 - we compiled out iconv() and real conversion was asked.

and this patch is about the latter; I do not think it is reasonable
to give noise only for the latter but not for the former.  The
latter is expected by users who compile out the feature, but the
former is not and deserves the wranings more, so the patch is
backwards in that sense.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] utf8.c: print warning about disabled iconv
  2015-06-08 16:16 ` Junio C Hamano
@ 2015-06-08 21:07   ` Max Kirillov
  2015-06-08 21:14     ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Max Kirillov @ 2015-06-08 21:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Eric Sunshine, Jeff King, git

Hi.

On Mon, Jun 08, 2015 at 09:16:16AM -0700, Junio C Hamano wrote:
> Max Kirillov <max@max630.net> writes:

>> This gives undesirable result that returned data or even
>> data written into repository is incorrect and user is not
>> aware about it.
> 
> I do not necessarily agree with that.  The user knows what
> s/he is doing, data written to or shown from the
> repository is correct as far as the user is concerned, and
> the user takes the full respoinsibility when compiling out
> certain features.

User, in theory, can be not the same person who builds, or
can be not aware that the case needs recoding. It actually
started when I compiled git without iconv support and got
about 10 failed tests, and only 2 of them mentioned i18n in
their name.

Compiling out other features is not exactly the same. If
user compiles out curl, for example, git will not be able to
push or fetch through http, but it is not going to pretend
to be working, it will fail visibly.

> I actually am OK if the user gets exactly the same warning between
> the two cases:
> 
>  - iconv failed to convert in the real reencode_string_len()
> 
>  - we compiled out iconv() and real conversion was asked.

Does 'exactly the same' mean the same text? Shouldn't it
describe the reason? I can see 2 possible failures in case
of real iconv: unknown or unsupported encoding and invalid
input. Wouldn't them better to be detailed in warning?

-- 
Max

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] utf8.c: print warning about disabled iconv
  2015-06-08 21:07   ` Max Kirillov
@ 2015-06-08 21:14     ` Junio C Hamano
  0 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2015-06-08 21:14 UTC (permalink / raw)
  To: Max Kirillov; +Cc: Eric Sunshine, Jeff King, git

Max Kirillov <max@max630.net> writes:

> User, in theory, can be not the same person who builds, or
> can be not aware that the case needs recoding.

Because you can pretty much say the same for build with iconv
enabled, I think that line of argument is futile.  The users do not
have control over platform's iconv (and sysadmin's choice of locale
packages) what charset/encoding can be converted to what other ones.

>> I actually am OK if the user gets exactly the same warning between
>> the two cases:
>> 
>>  - iconv failed to convert in the real reencode_string_len()
>> 
>>  - we compiled out iconv() and real conversion was asked.
>
> Does 'exactly the same' mean the same text?

No, I was trying to point out the total lack of corresponding
warnings in the iconv-enabled build.

After all, if you had to convert between UTF-8 and ISO-2022-JP, the
latter of which your system does not support, whether you use
iconv-disabled build of Git or iconv-enabled build of Git, we pass
the bytestream through, right?  Your patch gives warning for the
former (which is a good starting point if we want to warn "user
expected them to be converted, we didn't" case) but does not do
anything to the latter, even though users of the iconv-disabled
build is more likely to be aware of the potential issue (and are
likely to be willing to accept that) than the ones with
iconv-enabled build that runs on a sysmet that cannot convert the
specific encoding.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] utf8.c: print warning about iconv errors
  2015-06-06 21:02 [PATCH] utf8.c: print warning about disabled iconv Max Kirillov
  2015-06-08 16:16 ` Junio C Hamano
@ 2015-08-14 21:55 ` Max Kirillov
  2015-08-14 22:35   ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Max Kirillov @ 2015-08-14 21:55 UTC (permalink / raw)
  To: Eric Sunshine, Jeff King, Junio C Hamano; +Cc: Max Kirillov, git

If reencoding a text data from one encoding to another fails, the original
version is used insted. Currently there is no warning about failed reencoding,
which can have an undesired outcome that returned data is incorrect but user
is not aware about it.

Add printing warning when conversion fails.

Also add test script to assert that warning is actually printed and output is
not changed, as expected.

Signed-off-by: Max Kirillov <max@max630.net>
---
Changes since v1:
* rebase to recent changes
* add handling runtime errors
* add test
* do not limit number of warnings - does not worth complicating the code
* noticed that incomplete utf8 sequence in input silently treated as latin1.
  so mark the testcase as expect_failure. Actually, it's quite surprising,
  would be nice if somebody tries it in various environments
Actually, as far as I could grep, all uses of the resoding happen
only for printing, so probably it is not that important.
 t/t3911-show-reencode.sh | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 utf8.c                   | 24 +++++++++++++++++++++++-
 utf8.h                   |  7 ++-----
 3 files changed, 71 insertions(+), 6 deletions(-)
 create mode 100755 t/t3911-show-reencode.sh

diff --git a/t/t3911-show-reencode.sh b/t/t3911-show-reencode.sh
new file mode 100755
index 0000000..061d820
--- /dev/null
+++ b/t/t3911-show-reencode.sh
@@ -0,0 +1,46 @@
+#!/bin/sh
+
+test_description='reencoding'
+
+. ./test-lib.sh
+
+printf '\304\201\n' >a_macron_utf8
+printf '\303\244\n' >a_diaeresis_utf8
+printf '\303\244\304\n' >incomplete_utf8
+printf '\344\n' >a_diaeresis_latin1
+
+test_expect_success 'setup' '
+	git commit --allow-empty -F a_diaeresis_utf8 &&
+	git tag latin1_utf8 &&
+	git commit --allow-empty -F a_macron_utf8 &&
+	git tag extended_utf8 &&
+	git commit --allow-empty -F incomplete_utf8 &&
+	git tag invalid_utf8
+'
+
+test_expect_success 'encoding to latin1' '
+	git log --encoding=latin1 --pretty=format:%B -1 latin1_utf8 >out 2>err &&
+	test_must_be_empty err &&
+	test_cmp out a_diaeresis_latin1
+'
+
+test_expect_success 'unknown encoding' '
+	git log --encoding=no-encoding --pretty=format:%B -1 latin1_utf8 >out 2>err &&
+	grep -q "not supported" err &&
+	test_cmp out a_diaeresis_utf8
+'
+
+# apparently incomplete UTF8 byte sequences silently treated as latin1
+test_expect_failure 'incomplete utf8' '
+	git log --encoding=latin1 --pretty=format:%B -1 invalid_utf8 >out 2>err &&
+	grep -q "Invalid input" err &&
+	test_cmp out incomplete_utf8
+'
+
+test_expect_success 'does not fit into latin1' '
+	git log --encoding=latin1 --pretty=format:%B -1 extended_utf8 >out 2>err &&
+	grep -q "Invalid input" err &&
+	test_cmp out a_macron_utf8
+'
+
+test_done
diff --git a/utf8.c b/utf8.c
index 28e6d76..d284bb0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -465,7 +465,9 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs
 		if (cnt == (size_t) -1) {
 			size_t sofar;
 			if (errno != E2BIG) {
+				int failure_errno = errno;
 				free(out);
+				errno = failure_errno;
 				return NULL;
 			}
 			/* insz has remaining number of bytes.
@@ -513,14 +515,34 @@ char *reencode_string_len(const char *in, int insz,
 		if (is_encoding_utf8(out_encoding))
 			out_encoding = "UTF-8";
 		conv = iconv_open(out_encoding, in_encoding);
-		if (conv == (iconv_t) -1)
+		if (conv == (iconv_t) -1) {
+			if (errno == EINVAL)
+				warning("Conversion from %s to %s not supported, falling back to verbatim copy", in_encoding, out_encoding);
+			else
+				warning("Conversion from %s to %s failed: %s, falling back to verbatim copy", in_encoding, out_encoding, strerror(errno));
 			return NULL;
+		}
 	}
 
 	out = reencode_string_iconv(in, insz, conv, outsz);
+	if (out == NULL) {
+		if (errno == EILSEQ || errno == EINVAL)
+			warning("Invalid input for conversion from %s to %s, falling back to verbatim copy", in_encoding, out_encoding);
+		else
+			warning("Conversion from %s to %s failed: %s, falling back to verbatim copy", in_encoding, out_encoding, strerror(errno));
+	}
 	iconv_close(conv);
 	return out;
 }
+#else
+char *reencode_string_len(const char *in, int insz,
+			  const char *out_encoding, const char *in_encoding,
+			  int *outsz)
+{
+	if (!same_encoding(in_encoding, out_encoding))
+		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
+	return NULL;
+}
 #endif
 
 /*
diff --git a/utf8.h b/utf8.h
index 5a9e94b..c72998b 100644
--- a/utf8.h
+++ b/utf8.h
@@ -26,15 +26,12 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,
 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
 			    iconv_t conv, int *outsz);
+#endif
+
 char *reencode_string_len(const char *in, int insz,
 			  const char *out_encoding,
 			  const char *in_encoding,
 			  int *outsz);
-#else
-static inline char *reencode_string_len(const char *a, int b,
-					const char *c, const char *d, int *e)
-{ if (e) *e = 0; return NULL; }
-#endif
 
 static inline char *reencode_string(const char *in,
 				    const char *out_encoding,
-- 
2.3.4.2801.g3d0809b

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] utf8.c: print warning about iconv errors
  2015-08-14 21:55 ` [PATCH v2] utf8.c: print warning about iconv errors Max Kirillov
@ 2015-08-14 22:35   ` Junio C Hamano
  2015-08-17 19:02     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2015-08-14 22:35 UTC (permalink / raw)
  To: Max Kirillov; +Cc: Eric Sunshine, Jeff King, git

Max Kirillov <max@max630.net> writes:

> * do not limit number of warnings - does not worth complicating the code

Unless the warning leads to a quick "die()", wouldn't this make Git
unusable by spewing a "falling back to verbatim copy" for each and
every line of the message of a commit that has 'encoding' element in
its header in the "git log" output, no?

I suspect that this may be a huge mistake.

> +char *reencode_string_len(const char *in, int insz,
> +			  const char *out_encoding, const char *in_encoding,
> +			  int *outsz)
> +{
> +	if (!same_encoding(in_encoding, out_encoding))
> +		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
> +	return NULL;
> +}

Hmmm, I suspect this may be seen as regression by those who build
Git without ICONV for performance, knowing that there is nothing in
their data that requires character set conversion.

We'd call same_encoding() every time, which would involve a few
strcasecmp() calls.  Originally, we didn't even have a function call
overhead.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] utf8.c: print warning about iconv errors
  2015-08-14 22:35   ` Junio C Hamano
@ 2015-08-17 19:02     ` Jeff King
  2015-08-17 19:49       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2015-08-17 19:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Max Kirillov, Eric Sunshine, git

On Fri, Aug 14, 2015 at 03:35:58PM -0700, Junio C Hamano wrote:

> Max Kirillov <max@max630.net> writes:
> 
> > * do not limit number of warnings - does not worth complicating the code
> 
> Unless the warning leads to a quick "die()", wouldn't this make Git
> unusable by spewing a "falling back to verbatim copy" for each and
> every line of the message of a commit that has 'encoding' element in
> its header in the "git log" output, no?

We only do the reencode once per commit. So it would be once per commit
rather than once per line. Which still sounds kind of annoying, if you
are using "git log --oneline" or similar.

I think I'd favor a single warning in general, along the lines of
"some encodings could not be converted". But of course if you are trying
to figure out _which_ encodings your system doesn't have, that's not
very helpful. Maybe we could have an advice.encodingFailure config flag
with a tristate:

  - false: don't spew any warnings

  - true: give a generic warning once per program

  - all: give a specific warning for each case, like "unable to convert
    EUC-JP to UTF-8: iconv_open: Invalid argument". (Sadly EINVAL is
    what iconv_open seems to return when you it doesn't know about a
    particular encoding; it may be nicer to translate to something more
    reasonable than what strerror() provides).

> > +char *reencode_string_len(const char *in, int insz,
> > +			  const char *out_encoding, const char *in_encoding,
> > +			  int *outsz)
> > +{
> > +	if (!same_encoding(in_encoding, out_encoding))
> > +		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
> > +	return NULL;
> > +}
> 
> Hmmm, I suspect this may be seen as regression by those who build
> Git without ICONV for performance, knowing that there is nothing in
> their data that requires character set conversion.

I don't think it matters that much. The obvious tight loop is
logmsg_reencode, and it already checks same_encoding (because it really
wants to avoid reallocation in the first place if it can). So anybody
who cares about the performance of reencode_string_len would do better
to optimize out any calls to it. :)

If anything, we could make same_encoding faster by memo-izing its ptrs,
like:

diff --git a/utf8.c b/utf8.c
index 28e6d76..50a8ac0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -409,13 +409,26 @@ int is_encoding_utf8(const char *name)
 	return 0;
 }

-int same_encoding(const char *src, const char *dst)
+static int same_encoding_1(const char *src, const char *dst)
 {
+	warning("actually checking same_encoding(%s, %s)", src, dst);
 	if (is_encoding_utf8(src) && is_encoding_utf8(dst))
 		return 1;
 	return !strcasecmp(src, dst);
 }

+int same_encoding(const char *src, const char *dst)
+{
+	static const char *cached_src, *cached_dst;
+	static int cached_ret = -1;
+
+	if (src == cached_src && dst == cached_dst && cached_ret >= 0)
+		return cached_ret;
+	cached_src = src;
+	cached_dst = dst;
+	return cached_ret = same_encoding_1(src, dst);
+}
+
 /*
  * Wrapper for fprintf and returns the total number of columns required
  * for the printed string, assuming that the string is utf8.

But I couldn't measure any real speedup on "git log --oneline" from
doing so. It's also kind of gross (it will yield the wrong answer if you
write a different encoding to the same buffer; I don't think we do that,
but it's quite a gotcha).

Another approach would be to preserve NULL encodings (which we treat as
utf8) through the code base more. The common case of utf8 should be a
quick check for two NULLs, then. Unfortunately just teaching
get_commit_encoding and get_log_output_encoding to return NULL isn't
enough. Some parts of the code want to output the actual value (e.g.,
format-patch for a charset header), and would need to be adjusted. Given
that I couldn't measure any speedup, I don't think it's worth pursuing,
though.

-Peff

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] utf8.c: print warning about iconv errors
  2015-08-17 19:02     ` Jeff King
@ 2015-08-17 19:49       ` Junio C Hamano
  0 siblings, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2015-08-17 19:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Max Kirillov, Eric Sunshine, git

Jeff King <peff@peff.net> writes:

> On Fri, Aug 14, 2015 at 03:35:58PM -0700, Junio C Hamano wrote:
>
>> Max Kirillov <max@max630.net> writes:
>> 
>> > * do not limit number of warnings - does not worth complicating the code
>> 
>> Unless the warning leads to a quick "die()", wouldn't this make Git
>> unusable by spewing a "falling back to verbatim copy" for each and
>> every line of the message of a commit that has 'encoding' element in
>> its header in the "git log" output, no?
>
> We only do the reencode once per commit. So it would be once per commit
> rather than once per line. Which still sounds kind of annoying, if you
> are using "git log --oneline" or similar.
>
> I think I'd favor a single warning in general, along the lines of
> "some encodings could not be converted". But of course if you are trying
> to figure out _which_ encodings your system doesn't have, that's not
> very helpful. Maybe we could have an advice.encodingFailure config flag
> with a tristate:
>
>   - false: don't spew any warnings
>
>   - true: give a generic warning once per program
>
>   - all: give a specific warning for each case, like "unable to convert
>     EUC-JP to UTF-8: iconv_open: Invalid argument". (Sadly EINVAL is
>     what iconv_open seems to return when you it doesn't know about a
>     particular encoding; it may be nicer to translate to something more
>     reasonable than what strerror() provides).

Sounds sensible.

>> > +char *reencode_string_len(const char *in, int insz,
>> > +			  const char *out_encoding, const char *in_encoding,
>> > +			  int *outsz)
>> > +{
>> > +	if (!same_encoding(in_encoding, out_encoding))
>> > + warning("Iconv support is disabled at compile time. It is likely
>> > that\nincorrect data will be printed or stored in
>> > repository.\nConsider using other build for this task.");
>> > +	return NULL;
>> > +}
>> 
>> Hmmm, I suspect this may be seen as regression by those who build
>> Git without ICONV for performance, knowing that there is nothing in
>> their data that requires character set conversion.
>
> I don't think it matters that much.

Yeah, I think I agree.  Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-08-17 19:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-06 21:02 [PATCH] utf8.c: print warning about disabled iconv Max Kirillov
2015-06-08 16:16 ` Junio C Hamano
2015-06-08 21:07   ` Max Kirillov
2015-06-08 21:14     ` Junio C Hamano
2015-08-14 21:55 ` [PATCH v2] utf8.c: print warning about iconv errors Max Kirillov
2015-08-14 22:35   ` Junio C Hamano
2015-08-17 19:02     ` Jeff King
2015-08-17 19:49       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).