git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Max Kirillov <max@max630.net>,
	Eric Sunshine <sunshine@sunshineco.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v2] utf8.c: print warning about iconv errors
Date: Mon, 17 Aug 2015 15:02:38 -0400	[thread overview]
Message-ID: <20150817190238.GA3594@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqvbchfqgh.fsf@gitster.dls.corp.google.com>

On Fri, Aug 14, 2015 at 03:35:58PM -0700, Junio C Hamano wrote:

> Max Kirillov <max@max630.net> writes:
> 
> > * do not limit number of warnings - does not worth complicating the code
> 
> Unless the warning leads to a quick "die()", wouldn't this make Git
> unusable by spewing a "falling back to verbatim copy" for each and
> every line of the message of a commit that has 'encoding' element in
> its header in the "git log" output, no?

We only do the reencode once per commit. So it would be once per commit
rather than once per line. Which still sounds kind of annoying, if you
are using "git log --oneline" or similar.

I think I'd favor a single warning in general, along the lines of
"some encodings could not be converted". But of course if you are trying
to figure out _which_ encodings your system doesn't have, that's not
very helpful. Maybe we could have an advice.encodingFailure config flag
with a tristate:

  - false: don't spew any warnings

  - true: give a generic warning once per program

  - all: give a specific warning for each case, like "unable to convert
    EUC-JP to UTF-8: iconv_open: Invalid argument". (Sadly EINVAL is
    what iconv_open seems to return when you it doesn't know about a
    particular encoding; it may be nicer to translate to something more
    reasonable than what strerror() provides).

> > +char *reencode_string_len(const char *in, int insz,
> > +			  const char *out_encoding, const char *in_encoding,
> > +			  int *outsz)
> > +{
> > +	if (!same_encoding(in_encoding, out_encoding))
> > +		warning("Iconv support is disabled at compile time. It is likely that\nincorrect data will be printed or stored in repository.\nConsider using other build for this task.");
> > +	return NULL;
> > +}
> 
> Hmmm, I suspect this may be seen as regression by those who build
> Git without ICONV for performance, knowing that there is nothing in
> their data that requires character set conversion.

I don't think it matters that much. The obvious tight loop is
logmsg_reencode, and it already checks same_encoding (because it really
wants to avoid reallocation in the first place if it can). So anybody
who cares about the performance of reencode_string_len would do better
to optimize out any calls to it. :)

If anything, we could make same_encoding faster by memo-izing its ptrs,
like:

diff --git a/utf8.c b/utf8.c
index 28e6d76..50a8ac0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -409,13 +409,26 @@ int is_encoding_utf8(const char *name)
 	return 0;
 }
 
-int same_encoding(const char *src, const char *dst)
+static int same_encoding_1(const char *src, const char *dst)
 {
+	warning("actually checking same_encoding(%s, %s)", src, dst);
 	if (is_encoding_utf8(src) && is_encoding_utf8(dst))
 		return 1;
 	return !strcasecmp(src, dst);
 }
 
+int same_encoding(const char *src, const char *dst)
+{
+	static const char *cached_src, *cached_dst;
+	static int cached_ret = -1;
+
+	if (src == cached_src && dst == cached_dst && cached_ret >= 0)
+		return cached_ret;
+	cached_src = src;
+	cached_dst = dst;
+	return cached_ret = same_encoding_1(src, dst);
+}
+
 /*
  * Wrapper for fprintf and returns the total number of columns required
  * for the printed string, assuming that the string is utf8.


But I couldn't measure any real speedup on "git log --oneline" from
doing so. It's also kind of gross (it will yield the wrong answer if you
write a different encoding to the same buffer; I don't think we do that,
but it's quite a gotcha).

Another approach would be to preserve NULL encodings (which we treat as
utf8) through the code base more. The common case of utf8 should be a
quick check for two NULLs, then. Unfortunately just teaching
get_commit_encoding and get_log_output_encoding to return NULL isn't
enough. Some parts of the code want to output the actual value (e.g.,
format-patch for a charset header), and would need to be adjusted. Given
that I couldn't measure any speedup, I don't think it's worth pursuing,
though.

-Peff

  reply	other threads:[~2015-08-17 19:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-06 21:02 [PATCH] utf8.c: print warning about disabled iconv Max Kirillov
2015-06-08 16:16 ` Junio C Hamano
2015-06-08 21:07   ` Max Kirillov
2015-06-08 21:14     ` Junio C Hamano
2015-08-14 21:55 ` [PATCH v2] utf8.c: print warning about iconv errors Max Kirillov
2015-08-14 22:35   ` Junio C Hamano
2015-08-17 19:02     ` Jeff King [this message]
2015-08-17 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150817190238.GA3594@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=max@max630.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).