git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Johannes Sixt <j.sixt@viscovery.net>,
	Thomas Haller <thom311@gmail.com>, Git List <git@vger.kernel.org>
Subject: Re: [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly
Date: Mon, 25 Feb 2013 15:31:00 -0500	[thread overview]
Message-ID: <20130225203100.GA15932@sigill.intra.peff.net> (raw)
In-Reply-To: <7vmwusxlfm.fsf@alter.siamese.dyndns.org>

On Mon, Feb 25, 2013 at 11:06:37AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > ... I think the simplest thing would just be:
> >
> > diff --git a/utf8.c b/utf8.c
> > index 1087870..8d42b50 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -507,6 +507,17 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
> >  
> >  	if (!in_encoding)
> >  		return NULL;
> > +
> > +	/*
> > +	 * Some platforms do not have the variously spelled variants of
> > +	 * UTF-8, so let us feed iconv the most official spelling, which
> > +	 * should hopefully be accepted everywhere.
> > +	 */
> > +	if (is_encoding_utf8(in_encoding))
> > +		in_encoding = "UTF-8";
> > +	if (is_encoding_utf8(out_encoding))
> > +		out_encoding = "UTF-8";
> > +
> >  	conv = iconv_open(out_encoding, in_encoding);
> >  	if (conv == (iconv_t) -1)
> >  		return NULL;
> >
> > Does that fix the tests for you? It's a larger change, but I think it
> > makes git friendlier all around for people on Windows.
> 
> Yeah, if this is confirmed to work OK (from eyeballing I do not see
> a reason why not...) I agree this is the cleanest way forward.

The only reason I can think of is that you specify "utf8", your platform
understands "utf8" but not "UTF-8", and we rewrite it silently to
"UTF-8". That seems somewhat unlikely, but to be on the safe side, why
don't we do it as a fallback? That should be fine performance-wise, as
it only triggers on the error case.

Like this:

-- >8 --
Subject: [PATCH] utf8: accept alternate spellings of UTF-8

The iconv implementation on many platforms will accept
variants of UTF-8, including "UTF8", "utf-8", and "utf8",
but some do not. We make allowances in our code to treat
them all identically, but we sometimes hand the string from
the user directly to iconv. In this case, the platform iconv
may or may not work.

There are really four levels of platform iconv support for
these synonyms:

  1. All synonyms understood (e.g., glibc).

  2. Only the official "UTF-8" understood (e.g., Windows).

  3. Official "UTF-8" not understood, but some other synonym
     understood (it's not known whether such a platform exists).

  4. Neither "UTF-8" nor any synonym understood (e.g.,
     ancient systems, or ones without utf8 support
     installed).

This patch teaches git to fall back to using the official
"UTF-8" spelling when iconv_open fails (and the encoding was
one of the synonym spellings). This makes things more
convenient to users of type 2 systems, as they can now use
any of the synonyms for the log output encoding.

Type 1 systems are not affected, as iconv already works on
the first try.

Type 4 systems are not affected, as both attempts already
fail.

Type 3 systems will not benefit from the feature, but
because we only use "UTF-8" as a fallback, they will not be
regressed (i.e., you can continue to use "utf8" if your
platform supports it). We could try all the various
synonyms, but since such systems are not even known to
exist, it's not worth the effort.

Signed-off-by: Jeff King <peff@peff.net>
---
JSixt, can you double-check that this passes t4210 for you?

 utf8.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index 1087870..8f6e84b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -507,9 +507,25 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
 
 	if (!in_encoding)
 		return NULL;
+
 	conv = iconv_open(out_encoding, in_encoding);
-	if (conv == (iconv_t) -1)
-		return NULL;
+	if (conv == (iconv_t) -1) {
+		/*
+		 * Some platforms do not have the variously spelled variants of
+		 * UTF-8, so let's fall back to trying the most official
+		 * spelling. We do so only as a fallback in case the platform
+		 * does understand the user's spelling, but not our official
+		 * one.
+		 */
+		if (is_encoding_utf8(in_encoding))
+			in_encoding = "UTF-8";
+		if (is_encoding_utf8(out_encoding))
+			out_encoding = "UTF-8";
+		conv = iconv_open(out_encoding, in_encoding);
+		if (conv == (iconv_t) -1)
+			return NULL;
+	}
+
 	out = reencode_string_iconv(in, strlen(in), conv);
 	iconv_close(conv);
 	return out;
-- 
1.8.1.4.4.g265d2fa

  reply	other threads:[~2013-02-25 20:31 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08 23:52 segfault for git log --graph --no-walk --grep a Thomas Haller
2013-02-09  0:05 ` Junio C Hamano
2013-02-09  0:22   ` Junio C Hamano
2013-02-09  0:27     ` Jeff King
2013-02-09  0:39       ` Junio C Hamano
2013-02-09  0:47         ` Junio C Hamano
2013-02-09  1:05           ` Jeff King
2013-02-09  1:08             ` Jeff King
2013-02-11 19:16           ` Jeff King
2013-02-11 20:01             ` Junio C Hamano
2013-02-11 20:36               ` Junio C Hamano
2013-02-11 20:41                 ` Jeff King
2013-02-11 20:55                   ` Junio C Hamano
2013-02-11 20:59               ` [PATCH] log: re-encode commit messages before grepping Jeff King
2013-02-11 21:11                 ` Junio C Hamano
2013-02-11 21:14                   ` Jeff King
2013-02-25  8:37                 ` [PATCH ] t4210-log-i18n: spell encoding name "UTF-8" correctly Johannes Sixt
2013-02-25 15:19                   ` Jeff King
2013-02-25 19:06                     ` Junio C Hamano
2013-02-25 20:31                       ` Jeff King [this message]
2013-02-26  6:47                         ` Johannes Sixt
2013-02-25 21:00                     ` Torsten Bögershausen
2013-02-25 18:54                   ` Torsten Bögershausen
2013-02-25 20:36                     ` Jeff King
2013-02-09  0:29     ` segfault for git log --graph --no-walk --grep a Junio C Hamano
2013-02-09  0:39       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130225203100.GA15932@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j.sixt@viscovery.net \
    --cc=thom311@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).