[PATCH v1] convert: add support for 'encoding' attribute

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v1] convert: add support for 'encoding' attribute
@ 2017-12-11 15:50 lars.schneider
  2017-12-11 18:39 ` Eric Sunshine
                   ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: lars.schneider @ 2017-12-11 15:50 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, peff, patrick, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Reviewed-by: Patrick Lühne <patrick@luehne.de>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---

Hi,

here is a WIP patch to add text encoding support for files encoded with
something other than UTF-8 [RFC].

The 'encoding' attribute is already used to view blobs in gitk. That
could be a problem as the content is stored in Git with the defined
encoding. This patch would interpret the content as UTF-8 encoded and
it would try to reencode it to the defined encoding on checkout.
Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251").
These folks would see errors like these with my patch:
    error: failed to encode 'foo.bar' from utf-8 to cp1251

A quick search on GitHub reveals 2,722 repositories that use the
'encoding' attribute [1]. Using the GitHub API [2], I identified the
following encodings in all publicly accessible repositories:

    ANSI        <-- garbage (ignore by my implementation)
    cp1251
    cp866
    cp950
    iso8859-1
    koi8-r
    shiftjis    <-- garbage (ignore by my implementation)
    UTF-8       <-- unnecessary (ignore by my implementation)
    utf8        <-- garbage (ignore by my implementation)

TODOs:
    - The iconv docs mention that "errno == EINVAL" is harmless but
      we don't handle that case in utf8.c:reencode_string_iconv()
    - Git does not compile with NO_ICONV=1 right now because of
      compat/precompose_utf8.c. I will send a patch to fix that.

Questions:
    - Should I use warning() or error() in the patch?
      (currently I use the warning)
    - Do you agree with the approach in general?

Thanks,
Lars


[RFC] http://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com
  [1] https://github.com/search?p=1&q=encoding+filename%3Agitattributes&type=Code&utf8=%E2%9C%93
  [2] curl --user larsxschneider:secret -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq
  [3] https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples




Notes:
    Base Ref: master
    Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d
    Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && git checkout afc9e88a4d

 Documentation/gitattributes.txt |  27 ++++++
 convert.c                       | 192 +++++++++++++++++++++++++++++++++++++++-
 t/t0028-encoding.sh             |  60 +++++++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 t/t0028-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..84de2fa49c 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -146,6 +146,33 @@ Unspecified::
 Any other value causes Git to act as if `text` has been left
 unspecified.

+`encoding`
+^^^^^^^^^^
+
+By default Git assumes UTF-8 encoding for text files.  The `encoding`
+attribute sets the encoding to be used in the working directory.
+If the path is added to the index, then Git encodes the content to
+UTF-8.  On checkout the content is encoded back to the original
+encoding.  Consequently, you can use all built-in Git text processing
+tools (e.g. git diff, line ending conversions, etc.) with your
+non-UTF-8 encoded file.
+
+Please note that re-encoding content can cause errors and requires
+resources. Use the `encoding` attribute only if you cannot store
+a file in UTF-8 encoding and if you want Git to be able to process
+the text.
+
+------------------------
+*.txt		text encoding=UTF-16
+------------------------
+
+All `iconv` encodings with a stable round-trip conversion to and from
+UTF-8 are supported.  You can see a full list with the following command:
+
+------------------------
+iconv --list
+------------------------
+
 `eol`
 ^^^^^

diff --git a/convert.c b/convert.c
index 20d7ab67bd..ee19c17104 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"

 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,

 }

+#ifdef NO_ICONV
+#ifndef _ICONV_T
+/* The type is just a placeholder and not actually used. */
+typedef void* iconv_t;
+#endif
+#endif
+
+static struct encoding {
+	const char *name;
+	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
+	iconv_t to_worktree;  /* conversion to user-defined encoding */
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "utf-8";
+static iconv_t invalid_conversion = (iconv_t)-1;
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc)
+{
+#ifndef NO_ICONV
+	char *dst, *re_src;
+	int dst_len, re_src_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (enc->to_git == invalid_conversion) {
+		enc->to_git = iconv_open(default_encoding, encoding->name);
+		if (enc->to_git == invalid_conversion)
+			warning(_("unsupported encoding %s"), encoding->name);
+	}
+
+	if (enc->to_worktree == invalid_conversion)
+		enc->to_worktree = iconv_open(encoding->name, default_encoding);
+
+	/*
+	 * Do nothing if the encoding is not supported. This could happen in
+	 * two cases:
+	 *   (1) The encoding is garbage. That is no problem as we would not
+	 *       encode the content to UTF-8 on "add" and we would not encode
+	 *       it back on "checkout".
+	 *   (2) Git users use different iconv versions that support different
+	 *       encodings. This could lead to problems, as the content might
+	 *       not be encoded on "add" but encoded back on "checkout" (or
+	 *       the other way around).
+	 * We print a one-time warning to the user in both cases above.
+	 */
+	if (enc->to_git == invalid_conversion || enc->to_worktree == invalid_conversion)
+		return 0;
+
+	dst = reencode_string_iconv(src, src_len, enc->to_git, &dst_len);
+	if (!dst)
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		die(_("failed to encode '%s' from %s to %s"),
+			path, enc->name, default_encoding);
+
+	/*
+	 * Encode dst back to ensure no information is lost. This wastes
+	 * a few cycles as most conversions are round trip conversion
+	 * safe. However, content that has an invalid encoding might not
+	 * match its original byte sequence after the UTF-8 conversion
+	 * round trip. Let's play safe here and check the round trip
+	 * conversion.
+	 */
+	re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
+	if (!re_src || strcmp(src, re_src)) {
+		die(_("encoding '%s' from %s to %s and back is not the same"),
+			path, enc->name, default_encoding);
+	}
+	free(re_src);
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+#else
+	warning(_("cannot encode '%s' from %s to %s because "
+		"your Git was not compiled with encoding support"),
+		path, enc->name, default_encoding);
+	return 0;
+#endif
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+#ifndef NO_ICONV
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	if (enc->to_worktree == invalid_conversion) {
+		enc->to_worktree = iconv_open(encoding->name, default_encoding);
+		if (enc->to_worktree == invalid_conversion)
+			warning("unsupported encoding %s", encoding->name);
+	}
+
+	/*
+	 * Do nothing if the encoding is not supported.
+	 * See detailed explanation in encode_to_git().
+	 */
+	if (enc->to_worktree == invalid_conversion)
+		return 0;
+
+	dst = reencode_string_iconv(src, src_len, enc->to_worktree, &dst_len);
+	if (!dst) {
+		warning("failed to encode '%s' from %s to %s",
+			path, default_encoding, encoding->name);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+#else
+	warning(_("cannot encode '%s' from %s to %s because "
+		"your Git was not compiled with encoding support"),
+		path, default_encoding, encoding->name);
+	return 0;
+#endif
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -969,6 +1113,32 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }

+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_UNSET(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup(value);
+	enc->to_git = invalid_conversion;
+	enc->to_worktree = invalid_conversion;
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1024,6 +1194,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1032,8 +1203,9 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)

 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "encoding", NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}

@@ -1055,6 +1227,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1135,6 +1308,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.encoding);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
 		if (ret && dst) {
@@ -1158,6 +1338,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

+	encode_to_git(path, dst->buf, dst->len, dst, ca.encoding);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1189,6 +1370,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}

+	ret |= encode_to_worktree(path, src, len, dst, ca.encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1655,6 +1842,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;

+	if (ca.encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;

diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
new file mode 100755
index 0000000000..24f24f9e0d
--- /dev/null
+++ b/t/t0028-encoding.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+test_description='encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+
+	text="hallo there!\ncan you read me?" &&
+
+	echo "*.utf16 text encoding=utf-16" >.gitattributes &&
+	printf "$text" >t.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >t.utf16.raw &&
+	cp t.utf16.raw t.utf16 &&
+
+	git add .gitattributes t.utf16.raw t.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :t.utf16 >t.utf16.git &&
+	test_cmp_bin t.utf8.raw t.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm t.utf16 &&
+	git checkout t.utf16 &&
+	test_cmp_bin t.utf16.raw t.utf16
+'
+
+test_expect_success 'warn if an unsupported encoding is used' '
+	echo "*.garbage text encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	git add t.garbage 2>error.out &&
+	test_i18ngrep "warning: unsupported encoding" error.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'fail if files with invalid encoding are added' '
+	printf "\0\0h\0a" >error.utf16 &&
+	# The test string encoding would fail
+	# test_must_fail iconv -f utf-16 -t utf-8 error.utf16 &&
+	test_must_fail git add error.utf16
+'
+
+# Some sequences might trigger errno == E2BIG in reencode_string_iconv, utf.8.
+# This would cause no error on "git add" and, consequently, the Git internal
+# UTF-8 encoded blob would contain garbage. Hence, the worktree file after a
+# checkout would contain garbage, too. This garbage would not match the file
+# that was initially added.
+test_expect_success 'fail if encoding from X to UTF-8 and back to X is not the same' '
+	printf "\xc3\x28" >error.utf16 &&
+	# The test string re-encoding would fail
+	# iconv -f utf-16 -t utf-8 error.utf16 | iconv -f utf-8 -t utf-16 &&
+	test_must_fail git add error.utf16
+'
+
+test_done

base-commit: 95ec6b1b3393eb6e26da40c565520a8db9796e9f
--
2.15.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
@ 2017-12-11 18:39 ` Eric Sunshine
  2017-12-11 23:47   ` Lars Schneider
  2017-12-11 20:47 ` Johannes Sixt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Eric Sunshine @ 2017-12-11 18:39 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen, Jeff King,
	patrick, Lars Schneider

On Mon, Dec 11, 2017 at 10:50 AM,  <lars.schneider@autodesk.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Reviewed-by: Patrick Lühne <patrick@luehne.de>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +                        struct strbuf *buf, struct encoding *enc)
> +{
> +#ifndef NO_ICONV
> +       char *dst, *re_src;
> +       int dst_len, re_src_len;
> +
> +       /*
> +        * No encoding is specified or there is nothing to encode.
> +        * Tell the caller that the content was not modified.
> +        */
> +       if (!enc || (src && !src_len))
> +               return 0;
> +
> +       /*
> +        * Looks like we got called from "would_convert_to_git()".
> +        * This means Git wants to know if it would encode (= modify!)
> +        * the content. Let's answer with "yes", since an encoding was
> +        * specified.
> +        */
> +       if (!buf && !src)
> +               return 1;
> +
> +       if (enc->to_git == invalid_conversion) {
> +               enc->to_git = iconv_open(default_encoding, encoding->name);
> +               if (enc->to_git == invalid_conversion)
> +                       warning(_("unsupported encoding %s"), encoding->name);
> +       }
> +
> +       if (enc->to_worktree == invalid_conversion)
> +               enc->to_worktree = iconv_open(encoding->name, default_encoding);

Do you need to be calling iconv_close() somewhere on the result of the
iconv_open() calls? [Answering myself after reading the rest of the
patch: You're caching these opened 'iconv' descriptors, so you don't
plan on closing them.]

> + [...]
> +       /*
> +        * Encode dst back to ensure no information is lost. This wastes
> +        * a few cycles as most conversions are round trip conversion
> +        * safe. However, content that has an invalid encoding might not
> +        * match its original byte sequence after the UTF-8 conversion
> +        * round trip. Let's play safe here and check the round trip
> +        * conversion.
> +        */
> +       re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
> +       if (!re_src || strcmp(src, re_src)) {

You're using strcmp() as opposed to memcmp() because you expect
're_src' will unconditionally be UTF-8-encoded, right?

> +               die(_("encoding '%s' from %s to %s and back is not the same"),
> +                       path, enc->name, default_encoding);
> +       }
> +       free(re_src);
> +
> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +       return 1;
> +#else
> +       warning(_("cannot encode '%s' from %s to %s because "
> +               "your Git was not compiled with encoding support"),
> +               path, enc->name, default_encoding);
> +       return 0;
> +#endif
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
  2017-12-11 18:39 ` Eric Sunshine
@ 2017-12-11 20:47 ` Johannes Sixt
  2017-12-11 23:42   ` Lars Schneider
  2017-12-18 10:13   ` Torsten Bögershausen
  2017-12-15  9:58 ` Jeff King
  2017-12-17 17:14 ` Torsten Bögershausen
  3 siblings, 2 replies; 36+ messages in thread
From: Johannes Sixt @ 2017-12-11 20:47 UTC (permalink / raw)
  To: lars.schneider, git; +Cc: gitster, tboegi, peff, patrick, Lars Schneider

Am 11.12.2017 um 16:50 schrieb lars.schneider@autodesk.com:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Reviewed-by: Patrick Lühne <patrick@luehne.de>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> 
> Hi,
> 
> here is a WIP patch to add text encoding support for files encoded with
> something other than UTF-8 [RFC].
> 
> The 'encoding' attribute is already used to view blobs in gitk. That
> could be a problem as the content is stored in Git with the defined
> encoding. This patch would interpret the content as UTF-8 encoded and

This will be a major drawback for me because my code base stores text 
files that are not UTF-8 encoded. And I do use the existing 'encoding' 
attribute to view the text in git-gui and gitk. Repurposing this 
attribute name is not an option, IMO.

> it would try to reencode it to the defined encoding on checkout > Plus, many repos define the attribute very broad (e.g. "* 
encoding=cp1251").
> These folks would see errors like these with my patch:
>      error: failed to encode 'foo.bar' from utf-8 to cp1251

-- Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 20:47 ` Johannes Sixt
@ 2017-12-11 23:42   ` Lars Schneider
  2017-12-12  0:59     ` Junio C Hamano
  2017-12-12  7:09     ` Johannes Sixt
  2017-12-18 10:13   ` Torsten Bögershausen
  1 sibling, 2 replies; 36+ messages in thread
From: Lars Schneider @ 2017-12-11 23:42 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: lars.schneider, git, gitster, tboegi, peff, patrick


On 11 Dec 2017, at 21:47, Johannes Sixt <j6t@kdbg.org> wrote:

> Am 11.12.2017 um 16:50 schrieb lars.schneider@autodesk.com:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> Reviewed-by: Patrick Lühne <patrick@luehne.de>
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> Hi,
>> here is a WIP patch to add text encoding support for files encoded with
>> something other than UTF-8 [RFC].
>> The 'encoding' attribute is already used to view blobs in gitk. That
>> could be a problem as the content is stored in Git with the defined
>> encoding. This patch would interpret the content as UTF-8 encoded and
> 
> This will be a major drawback for me because my code base stores text files that are not UTF-8 encoded. And I do use the existing 'encoding' attribute to view the text in git-gui and gitk. Repurposing this attribute name is not an option, IMO.

I understand your point of view and I kind of expected that that reply.
Thanks for the feedback!

Question is: Given that "encoding" is not available, how could I name
             the attribute without confusing the user?

I contemplated:
  - "enc" or "encode" because "eol" and "ident" use abbreviations, too
    (enc could be confused with encryption. plus, a user might ask
     what is the difference between "enc" and "encoding" attribute :-)
  - "wte", "wtenc", or "worktree-encoding" to emphasize that this is 
    the encoding used in the worktree 
    (I fear that users think that is git-worktree, the command, related)

I think my favorite is "worktree-encoding".
What do you think?

Thanks,
Lars 


BTW: I am curios, can you share what encoding you use?
My main use case is UTF-16 and I was surprised that I haven't
found a single public repo on github.com with "encoding=utf-16"


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 18:39 ` Eric Sunshine
@ 2017-12-11 23:47   ` Lars Schneider
  2017-12-11 23:58     ` Eric Sunshine
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-11 23:47 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Jeff King, patrick


On 11 Dec 2017, at 19:39, Eric Sunshine <sunshine@sunshineco.com> wrote:

> On Mon, Dec 11, 2017 at 10:50 AM,  <lars.schneider@autodesk.com> wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Reviewed-by: Patrick Lühne <patrick@luehne.de>
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +                        struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +       char *dst, *re_src;
>> +       int dst_len, re_src_len;
>> +
>> +       /*
>> +        * No encoding is specified or there is nothing to encode.
>> +        * Tell the caller that the content was not modified.
>> +        */
>> +       if (!enc || (src && !src_len))
>> +               return 0;
>> +
>> +       /*
>> +        * Looks like we got called from "would_convert_to_git()".
>> +        * This means Git wants to know if it would encode (= modify!)
>> +        * the content. Let's answer with "yes", since an encoding was
>> +        * specified.
>> +        */
>> +       if (!buf && !src)
>> +               return 1;
>> +
>> +       if (enc->to_git == invalid_conversion) {
>> +               enc->to_git = iconv_open(default_encoding, encoding->name);
>> +               if (enc->to_git == invalid_conversion)
>> +                       warning(_("unsupported encoding %s"), encoding->name);
>> +       }
>> +
>> +       if (enc->to_worktree == invalid_conversion)
>> +               enc->to_worktree = iconv_open(encoding->name, default_encoding);
> 
> Do you need to be calling iconv_close() somewhere on the result of the
> iconv_open() calls? [Answering myself after reading the rest of the
> patch: You're caching these opened 'iconv' descriptors, so you don't
> plan on closing them.]

Should this information go into the commit message to avoid confusing
future readers? I think, yes.


>> + [...]
>> +       /*
>> +        * Encode dst back to ensure no information is lost. This wastes
>> +        * a few cycles as most conversions are round trip conversion
>> +        * safe. However, content that has an invalid encoding might not
>> +        * match its original byte sequence after the UTF-8 conversion
>> +        * round trip. Let's play safe here and check the round trip
>> +        * conversion.
>> +        */
>> +       re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
>> +       if (!re_src || strcmp(src, re_src)) {
> 
> You're using strcmp() as opposed to memcmp() because you expect
> 're_src' will unconditionally be UTF-8-encoded, right?

Yes. But since you mention it, I think it would be better to use
memcmp() here! Plus, it might be a tiny bit faster ;-)

Thanks,
Lars

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 23:47   ` Lars Schneider
@ 2017-12-11 23:58     ` Eric Sunshine
  2017-12-12 10:58       ` Lars Schneider
  0 siblings, 1 reply; 36+ messages in thread
From: Eric Sunshine @ 2017-12-11 23:58 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Jeff King, patrick

On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneider
<larsxschneider@gmail.com> wrote:
> On 11 Dec 2017, at 19:39, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> On Mon, Dec 11, 2017 at 10:50 AM,  <lars.schneider@autodesk.com> wrote:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>
>>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>>> encoding. Git will happily accept content in all other encodings, too,
>>> but it might not be able to process the text (e.g. viewing diffs or
>>> changing line endings).
>>>
>>> Add an attribute to tell Git what encoding the user has defined for a
>>> given file. If the content is added to the index, then Git converts the
>>> content to a canonical UTF-8 representation. On checkout Git will
>>> reverse the conversion.
>>>
>>> Reviewed-by: Patrick Lühne <patrick@luehne.de>
>>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>>> ---
>>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>>> +                        struct strbuf *buf, struct encoding *enc)
>>> +{
>>> +       if (enc->to_git == invalid_conversion) {
>>> +               enc->to_git = iconv_open(default_encoding, encoding->name);
>>> +               if (enc->to_git == invalid_conversion)
>>> +                       warning(_("unsupported encoding %s"), encoding->name);
>>> +       }
>>> +
>>> +       if (enc->to_worktree == invalid_conversion)
>>> +               enc->to_worktree = iconv_open(encoding->name, default_encoding);
>>
>> Do you need to be calling iconv_close() somewhere on the result of the
>> iconv_open() calls? [Answering myself after reading the rest of the
>> patch: You're caching these opened 'iconv' descriptors, so you don't
>> plan on closing them.]
>
> Should this information go into the commit message to avoid confusing
> future readers? I think, yes.

Maybe. However, the code which does the actual caching is so distant
from these iconv_open() invocations that it might be more helpful to
have an in-code comment here saying that the "missing" iconv_close()
invocations is intentional.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 23:42   ` Lars Schneider
@ 2017-12-12  0:59     ` Junio C Hamano
  2017-12-12  7:15       ` Johannes Sixt
  2017-12-12  7:09     ` Johannes Sixt
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2017-12-12  0:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Johannes Sixt, lars.schneider, git, tboegi, peff, patrick

Lars Schneider <larsxschneider@gmail.com> writes:

> I contemplated:
>   - "enc" or "encode" because "eol" and "ident" use abbreviations, too
>     (enc could be confused with encryption. plus, a user might ask
>      what is the difference between "enc" and "encoding" attribute :-)
>   - "wte", "wtenc", or "worktree-encoding" to emphasize that this is 
>     the encoding used in the worktree 
>     (I fear that users think that is git-worktree, the command, related)

In the context of Git, the word "worktree" does have a specific
meaning that is different from working tree.  

Stepping back a bit, what does this thing do you are introducing?
And what does the other thing do that J6t is using, that would get
confused with this new one?

What does the other one do?  "Declare that the contents of this path
is in this encoding"?  As opposed to the new one, which tells Git to
"run iconv from and to this encoding when checking out and checking
in"?

If so, any phrase that depends heavily on the word "encode" would
not help differenciating the two uses.  The phrase needs to be
something that contrasts the new one, which actively modifies things
(what is on the filesystem is not what is stored in the object
store), with the old one, which does not (passed as a declaration to
a viewer what encoding the contents already use and does not change
anything).

Do people who will use this feature familiar with the concept of
smudge/clean?  If you want to avoid "working-tree" (or "worktree",
which definitely you would want to avoid) because you fear confused
users, perhaps "smudge-encoding" would work (we declare that the
result of smudge operations are left in this encoding, so the
opposite operation "clean" will do the reverse---and we say this
without explicitly saying that the other end of the conversion is
always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
not say the opposite operation "checkin/add" will do the reverse).

I personally do not think "working-tree-encoding" is too horrible,
but I do agree that some users may be confused.  So I dunno.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 23:42   ` Lars Schneider
  2017-12-12  0:59     ` Junio C Hamano
@ 2017-12-12  7:09     ` Johannes Sixt
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Sixt @ 2017-12-12  7:09 UTC (permalink / raw)
  To: Lars Schneider; +Cc: lars.schneider, git, gitster, tboegi, peff, patrick

Am 12.12.2017 um 00:42 schrieb Lars Schneider:
> BTW: I am curios, can you share what encoding you use?
> My main use case is UTF-16 and I was surprised that I haven't
> found a single public repo on github.com with "encoding=utf-16"

Shift-JIS and CP1252. These are used for Windows resource files (*.rc).

-- Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-12  0:59     ` Junio C Hamano
@ 2017-12-12  7:15       ` Johannes Sixt
  2017-12-12 10:55         ` Lars Schneider
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Sixt @ 2017-12-12  7:15 UTC (permalink / raw)
  To: Junio C Hamano, Lars Schneider; +Cc: lars.schneider, git, tboegi, peff, patrick

Am 12.12.2017 um 01:59 schrieb Junio C Hamano:
> Stepping back a bit, what does this thing do you are introducing?
> And what does the other thing do that J6t is using, that would get
> confused with this new one?
> 
> What does the other one do?  "Declare that the contents of this path
> is in this encoding"?  As opposed to the new one, which tells Git to
> "run iconv from and to this encoding when checking out and checking
> in"?
> 
> If so, any phrase that depends heavily on the word "encode" would
> not help differenciating the two uses.  The phrase needs to be
> something that contrasts the new one, which actively modifies things
> (what is on the filesystem is not what is stored in the object
> store), with the old one, which does not (passed as a declaration to
> a viewer what encoding the contents already use and does not change
> anything).

Well explained!

> ...  perhaps "smudge-encoding" would work (we declare that the
> result of smudge operations are left in this encoding, so the
> opposite operation "clean" will do the reverse---and we say this
> without explicitly saying that the other end of the conversion is
> always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
> not say the opposite operation "checkin/add" will do the reverse).

I would favor "checkout-encoding" over "smudge-encoding" only because 
"checkout" is better known than "smudge", I would think. I do not have 
better suggestions.

-- Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-12  7:15       ` Johannes Sixt
@ 2017-12-12 10:55         ` Lars Schneider
  2017-12-12 19:31           ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-12 10:55 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Junio C Hamano, Lars Schneider, Git List,
	Torsten Bögershausen, peff, patrick


> On 12 Dec 2017, at 08:15, Johannes Sixt <j6t@kdbg.org> wrote:
> 
> Am 12.12.2017 um 01:59 schrieb Junio C Hamano:
>> Stepping back a bit, what does this thing do you are introducing?
>> And what does the other thing do that J6t is using, that would get
>> confused with this new one?
>> What does the other one do?  "Declare that the contents of this path
>> is in this encoding"?  As opposed to the new one, which tells Git to
>> "run iconv from and to this encoding when checking out and checking
>> in"?
>> If so, any phrase that depends heavily on the word "encode" would
>> not help differenciating the two uses.  The phrase needs to be
>> something that contrasts the new one, which actively modifies things
>> (what is on the filesystem is not what is stored in the object
>> store), with the old one, which does not (passed as a declaration to
>> a viewer what encoding the contents already use and does not change
>> anything).
> 
> Well explained!
> 
>> ...  perhaps "smudge-encoding" would work (we declare that the
>> result of smudge operations are left in this encoding, so the
>> opposite operation "clean" will do the reverse---and we say this
>> without explicitly saying that the other end of the conversion is
>> always UTF-8)?  Or "checkout-encoding" (the same explanation; we do
>> not say the opposite operation "checkin/add" will do the reverse).
> 
> I would favor "checkout-encoding" over "smudge-encoding" only because "checkout" is better known than "smudge", I would think. I do not have better suggestions.

Thanks for your thoughts! "checkout-encoding" would work for me.
However, it might convey that Git "does change the files of the
user in some way" (which is true from Git's perspective but not
from the user perspective).

Patrick and I brainstormed a few more possible alternatives:

apply-encoding
blob-encoding
checkout-encoding
diff-encoding
encoding-hint
external-encoding
handle-as
internal-encoding
keep-encoding
present-as
preserve-encoding
source-encoding
text-conversion
text-encoding
treat-as 
treat-encoding-as

Our favorite is "treat-encoding-as". Do you consider this better
or worse than "checkout-encoding"?

Thanks,
Lars


PS: Naming things is hard ;-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 23:58     ` Eric Sunshine
@ 2017-12-12 10:58       ` Lars Schneider
  0 siblings, 0 replies; 36+ messages in thread
From: Lars Schneider @ 2017-12-12 10:58 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Jeff King, patrick


> On 12 Dec 2017, at 00:58, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneider
> <larsxschneider@gmail.com> wrote:
>> On 11 Dec 2017, at 19:39, Eric Sunshine <sunshine@sunshineco.com> wrote:
>>> On Mon, Dec 11, 2017 at 10:50 AM,  <lars.schneider@autodesk.com> wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>>>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>>>> encoding. Git will happily accept content in all other encodings, too,
>>>> but it might not be able to process the text (e.g. viewing diffs or
>>>> changing line endings).
>>>> 
>>>> Add an attribute to tell Git what encoding the user has defined for a
>>>> given file. If the content is added to the index, then Git converts the
>>>> content to a canonical UTF-8 representation. On checkout Git will
>>>> reverse the conversion.
>>>> 
>>>> Reviewed-by: Patrick Lühne <patrick@luehne.de>
>>>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>>>> ---
>>>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>>>> +                        struct strbuf *buf, struct encoding *enc)
>>>> +{
>>>> +       if (enc->to_git == invalid_conversion) {
>>>> +               enc->to_git = iconv_open(default_encoding, encoding->name);
>>>> +               if (enc->to_git == invalid_conversion)
>>>> +                       warning(_("unsupported encoding %s"), encoding->name);
>>>> +       }
>>>> +
>>>> +       if (enc->to_worktree == invalid_conversion)
>>>> +               enc->to_worktree = iconv_open(encoding->name, default_encoding);
>>> 
>>> Do you need to be calling iconv_close() somewhere on the result of the
>>> iconv_open() calls? [Answering myself after reading the rest of the
>>> patch: You're caching these opened 'iconv' descriptors, so you don't
>>> plan on closing them.]
>> 
>> Should this information go into the commit message to avoid confusing
>> future readers? I think, yes.
> 
> Maybe. However, the code which does the actual caching is so distant
> from these iconv_open() invocations that it might be more helpful to
> have an in-code comment here saying that the "missing" iconv_close()
> invocations is intentional.

Agreed. I will add that in v2.

Thanks,
Lars


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-12 10:55         ` Lars Schneider
@ 2017-12-12 19:31           ` Junio C Hamano
  2017-12-13 17:57             ` Lars Schneider
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2017-12-12 19:31 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Johannes Sixt, Lars Schneider, Git List,
	Torsten Bögershausen, peff, patrick

Lars Schneider <larsxschneider@gmail.com> writes:

> Our favorite is "treat-encoding-as". Do you consider this better
> or worse than "checkout-encoding"?

I am afraid that "treat as" is not sufficiently specific and would
invite a misinterpretation, e.g. "You record the bytes I throw at
you as-is in the object store, but treat them appropriately as
contents that are encoded in cp1252 when presenting".

what is missing is at what stage in the overall user experience does
that "treating" happens.  That causes such a misinterpretation.

So from that point of view, "checkout-" or" working-tree-" would be
a better phrase to accompany "encoding" to clarify what this attr is
for than "treat-as".

Having said all that, this "feature" would need a moderate amount of
clear description in the documentation, and between the perfect and
a suboptimal name, the amount of explanation required would not be
all that different, I suspect.  We need to say that those who wish
to use this attribute are buying into recording their contents in
UTF-8 and when the contents are externalized to the working tree,
they are converted to the encoding the value of this attribute
specifies and vice versa.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-12 19:31           ` Junio C Hamano
@ 2017-12-13 17:57             ` Lars Schneider
  2017-12-13 18:11               ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-13 17:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Sixt, Lars Schneider, Git List,
	Torsten Bögershausen, peff, patrick


> On 12 Dec 2017, at 20:31, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> Our favorite is "treat-encoding-as". Do you consider this better
>> or worse than "checkout-encoding"?
> 
> I am afraid that "treat as" is not sufficiently specific and would
> invite a misinterpretation, e.g. "You record the bytes I throw at
> you as-is in the object store, but treat them appropriately as
> contents that are encoded in cp1252 when presenting".
> 
> what is missing is at what stage in the overall user experience does
> that "treating" happens.  That causes such a misinterpretation.
> 
> So from that point of view, "checkout-" or" working-tree-" would be
> a better phrase to accompany "encoding" to clarify what this attr is
> for than "treat-as".

Alright. I'll go with "checkout-encoding" then.


> Having said all that, this "feature" would need a moderate amount of
> clear description in the documentation, and between the perfect and
> a suboptimal name, the amount of explanation required would not be
> all that different, I suspect.  We need to say that those who wish
> to use this attribute are buying into recording their contents in
> UTF-8 and when the contents are externalized to the working tree,
> they are converted to the encoding the value of this attribute
> specifies and vice versa.

Absolutely! The docs will come in v2. I just wanted to send out a v1
to make people aware of what I am working on.

I think storing the encoding in gitattributes might not be the perfect
solution but probably a pragmatic one (because it works and the changes
to Git are rather small). In a perfect world I think I would store
the encoding of a file in the tree object. I didn't pursue that solution
as this would change the Git data model which would open a can of worms
for a problem that not that many people have (almost everyone is on
UTF-8 anyways).

- Lars

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-13 17:57             ` Lars Schneider
@ 2017-12-13 18:11               ` Junio C Hamano
  2017-12-13 23:02                 ` Lars Schneider
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2017-12-13 18:11 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Johannes Sixt, Lars Schneider, Git List,
	Torsten Bögershausen, peff, patrick

Lars Schneider <larsxschneider@gmail.com> writes:

> ... In a perfect world I think I would store
> the encoding of a file in the tree object. I didn't pursue that solution
> as this would change the Git data model which would open a can of worms
> for a problem that not that many people have (almost everyone is on
> UTF-8 anyways).

Having that "encoding" trailt recorded in the tree that contains the
blob would mean that the same blob can be recorded with one
"encoding" trait in a tree, and in a different tree it can be
recorded with a different "encoding" trait.  I doubt it really makes
sense.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-13 18:11               ` Junio C Hamano
@ 2017-12-13 23:02                 ` Lars Schneider
  2017-12-14 23:01                   ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-13 23:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Sixt, Lars Schneider, Git List,
	Torsten Bögershausen, Jeff King, patrick

> On 13 Dec 2017, at 19:11, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> ... In a perfect world I think I would store
>> the encoding of a file in the tree object. I didn't pursue that solution
>> as this would change the Git data model which would open a can of worms
>> for a problem that not that many people have (almost everyone is on
>> UTF-8 anyways).
> 
> Having that "encoding" trailt recorded in the tree that contains the
> blob would mean that the same blob can be recorded with one
> "encoding" trait in a tree, and in a different tree it can be
> recorded with a different "encoding" trait.  I doubt it really makes
> sense.

The "blob object" would store the text data encoded in a canonical 
UTF-8 form. The "tree object" would store the encoding. On checkin 
Git would convert the text from the stored encoding to UTF-8 and on 
checkout it would do the reverse.

That way you could control the encoding for a text file specific
for each path similar to the "mode bits". That also means you could
change the encoding of a file while the blob content stays the same.

Changing the encoding of a file with the .gitattributes approach
can be difficult if you have configured the attribute with a very 
broad pattern (e.g. *.foo). You would either need to rename the
file or limit the scope of your pattern.

- Lars

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-13 23:02                 ` Lars Schneider
@ 2017-12-14 23:01                   ` Junio C Hamano
  0 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2017-12-14 23:01 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Johannes Sixt, Lars Schneider, Git List,
	Torsten Bögershausen, Jeff King, patrick

Lars Schneider <larsxschneider@gmail.com> writes:

> That way you could control the encoding for a text file specific
> for each path similar to the "mode bits". That also means you could
> change the encoding of a file while the blob content stays the same.

That is exactly why I said that I doubt it makes sense.

When you have the same blob object (that is in UTF-8) appear at two
places in a tree (or two different subtrees inside a single tree)
and the two tree entries that point at the blob tells contradicting
story, you would have a checkout of the same contents in two
different encodings.  If you have the same blob object appear in two
adjacent commits at the same path, with one commit's tree recording
its encoding differently from the other's, you would switch
encodings when you switch branches.  I doubt these are "features",
but sources of confusion.

Be it a feature, or a misfeature, it is shared with the existing
solution which is .gitattributes embedded in the tree, so you are
not making anything better or worse by moving the information to
tree entry.

What I actually expected to hear when somebody talks about "ideal"
(which may not even be achievable given existing user base) was to
make blob object declare its desired external representation.  That
would remove the two possible confusion cited above, because once
you have a blob, you have everything needed to externalize it.

In any case, I do not think we'd go there anyway, so ...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
  2017-12-11 18:39 ` Eric Sunshine
  2017-12-11 20:47 ` Johannes Sixt
@ 2017-12-15  9:58 ` Jeff King
  2017-12-18 10:54   ` Lars Schneider
  2017-12-17 17:14 ` Torsten Bögershausen
  3 siblings, 1 reply; 36+ messages in thread
From: Jeff King @ 2017-12-15  9:58 UTC (permalink / raw)
  To: lars.schneider; +Cc: git, gitster, tboegi, patrick, Lars Schneider

On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:

> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.

This made me wonder what happens when you have a file that _was_ utf16,
and then you later convert it to utf8 and add a .gitattributes entry.

I tried this with your patch:

  git init repo
  cd repo

  echo foo | iconv -t utf16 >file
  git add file
  git commit -m utf16

  echo 'file encoding=utf16' >.gitattributes
  touch file ;# make it stat-dirty
  git commit -m convert

  git checkout HEAD^

That works OK, because we try to read the attributes from the
destination tree for a checkout. If you do this:

  echo 'file encoding=utf16' >.git/info/attributes
  git checkout HEAD^

then we get:

  warning: failed to encode 'file' from utf-8 to utf16

At least it figured out that it couldn't convert the content. It's
slightly troubling that it would try in the first place, though; are
there encoding pairs where we might accidentally generate nonsense?

It _should_ be uncommon, I think, to have a repo-level attribute set
like that. It's very common for us to use whatever happens to be in the
checked-out .gitattributes for some attributes (e.g., when doing a diff
of an older commit), but I think for checkout-related ones it's not.  So
I think it may generally work in practice. And certainly the line-ending
code would share any similar problems, but at least there the result is
less confusing than mojibake.

Playing around, I also managed to do this:

  echo 'file encoding=utf16' >.gitattributes
  echo foo >file

  # I did these with an old version of git that didn't
  # support the new attribute, so they blindly added the utf8 content.
  git add .
  git commit -m convert

  git.compile checkout HEAD^

which yielded:

  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

Now obviously my situation is somewhat nonsense. I was trying to force
the in-repo representation to utf8, but ended up with a mismatched
working tree file. But what's somewhat troubling is that I couldn't
checkout _away_ from that state due to the die() in convert_to_git().
Which is in turn just there as part of refresh_index().

And indeed, other commands hit the same problem:

  $ git.compile diff
  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

  $ git.compile checkout -f
  fatal: encoding 'file' from utf16 to utf-8 and back is not the same

It may make sense to die() during "git add ." (since we're actually
changing the index entry, and we don't want to put nonsense into a
tree). But I'm not sure it's the best thing for operations which just
want to read the content. For them, perhaps it would be more appropriate
to issue a warning and return the untouched content.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
                   ` (2 preceding siblings ...)
  2017-12-15  9:58 ` Jeff King
@ 2017-12-17 17:14 ` Torsten Bögershausen
  2017-12-28 16:14   ` Lars Schneider
  3 siblings, 1 reply; 36+ messages in thread
From: Torsten Bögershausen @ 2017-12-17 17:14 UTC (permalink / raw)
  To: lars.schneider; +Cc: git, gitster, peff, patrick, Lars Schneider

On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Reviewed-by: Patrick Lühne <patrick@luehne.de>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> 
> Hi,
> 
> here is a WIP patch to add text encoding support for files encoded with
> something other than UTF-8 [RFC].
> 
> The 'encoding' attribute is already used to view blobs in gitk. That
> could be a problem as the content is stored in Git with the defined
> encoding. This patch would interpret the content as UTF-8 encoded and
> it would try to reencode it to the defined encoding on checkout.
> Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251").
> These folks would see errors like these with my patch:
>     error: failed to encode 'foo.bar' from utf-8 to cp1251
> 
> A quick search on GitHub reveals 2,722 repositories that use the
> 'encoding' attribute [1]. Using the GitHub API [2], I identified the
> following encodings in all publicly accessible repositories:
> 
>     ANSI        <-- garbage (ignore by my implementation)
>     cp1251
>     cp866
>     cp950
>     iso8859-1
>     koi8-r
>     shiftjis    <-- garbage (ignore by my implementation)
>     UTF-8       <-- unnecessary (ignore by my implementation)
>     utf8        <-- garbage (ignore by my implementation)
> 
> TODOs:
>     - The iconv docs mention that "errno == EINVAL" is harmless but
>       we don't handle that case in utf8.c:reencode_string_iconv()
>     - Git does not compile with NO_ICONV=1 right now because of
>       compat/precompose_utf8.c. I will send a patch to fix that.

Minor: Does Git not compile under MacOS when setting NO_ICONV=1 ?
This is possible not introduced by the patch. 

> 
> Questions:
>     - Should I use warning() or error() in the patch?
>       (currently I use the warning)

Errors, see below.

>     - Do you agree with the approach in general?

Yes, some comments inline.

> 
> Thanks,
> Lars
> 
> 
> [RFC] http://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com
>   [1] https://github.com/search?p=1&q=encoding+filename%3Agitattributes&type=Code&utf8=%E2%9C%93
>   [2] curl --user larsxschneider:secret -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq
>   [3] https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples
> 
> 
> 
> 
> Notes:
>     Base Ref: master
>     Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d
>     Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && git checkout afc9e88a4d
> 
>  Documentation/gitattributes.txt |  27 ++++++
>  convert.c                       | 192 +++++++++++++++++++++++++++++++++++++++-
>  t/t0028-encoding.sh             |  60 +++++++++++++
>  3 files changed, 278 insertions(+), 1 deletion(-)
>  create mode 100755 t/t0028-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..84de2fa49c 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -146,6 +146,33 @@ Unspecified::
>  Any other value causes Git to act as if `text` has been left
>  unspecified.
> 
> +`encoding`
> +^^^^^^^^^^
> +
> +By default Git assumes UTF-8 encoding for text files.  The `encoding`

This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.

> +attribute sets the encoding to be used in the working directory.
> +If the path is added to the index, then Git encodes the content to
> +UTF-8.  On checkout the content is encoded back to the original
> +encoding.  Consequently, you can use all built-in Git text processing
> +tools (e.g. git diff, line ending conversions, etc.) with your
> +non-UTF-8 encoded file.
> +
> +Please note that re-encoding content can cause errors and requires
> +resources. Use the `encoding` attribute only if you cannot store
> +a file in UTF-8 encoding and if you want Git to be able to process
> +the text.
> +
> +------------------------
> +*.txt		text encoding=UTF-16
> +------------------------

I think that encoding (or worktree-encoding as said elsewhere) must imply text.
(That is not done in convert.c, but assuming binary or even auto
does not make much sense to me)


> +
> +All `iconv` encodings with a stable round-trip conversion to and from
> +UTF-8 are supported.  You can see a full list with the following command:
> +
> +------------------------
> +iconv --list
> +------------------------
> +
>  `eol`
>  ^^^^^
> 
> diff --git a/convert.c b/convert.c
> index 20d7ab67bd..ee19c17104 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
> 
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> 
>  }

I would avoid to use these #ifdefs here.
All functions can be in strbuf.c (and may have #ifdefs there), but not here.

> 
> +#ifdef NO_ICONV
> +#ifndef _ICONV_T
> +/* The type is just a placeholder and not actually used. */
> +typedef void* iconv_t;
> +#endif
> +#endif
> +
> +static struct encoding {
> +	const char *name;
> +	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
> +	iconv_t to_worktree;  /* conversion to user-defined encoding */
> +	struct encoding *next;
> +} *encoding, **encoding_tail;

This seems like an optimazation, assuning that iconv_open() eats a lot of ressources.
I don't think this is the case. (Undere MacOS we run iconv_open() together with
every opendir(), and optimizing this out did not show any measurable improvements)

> +static const char *default_encoding = "utf-8";

The most portable form is "UTF-8" (correct me if that is wrong)

> +static iconv_t invalid_conversion = (iconv_t)-1;
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc)
> +{
> +#ifndef NO_ICONV
> +	char *dst, *re_src;
> +	int dst_len, re_src_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (enc->to_git == invalid_conversion) {
> +		enc->to_git = iconv_open(default_encoding, encoding->name);
> +		if (enc->to_git == invalid_conversion)
> +			warning(_("unsupported encoding %s"), encoding->name);
> +	}

 	/* There are 2 different types of reaction:
	      Either users  know what that a warning means: You asked for problems,
	      	        and do the right thing. Other may may ignore the warning
			       - in both cases an error is useful */

> +
> +	if (enc->to_worktree == invalid_conversion)
> +		enc->to_worktree = iconv_open(encoding->name, default_encoding);
> +
> +	/*
> +	 * Do nothing if the encoding is not supported. This could happen in
> +	 * two cases:
> +	 *   (1) The encoding is garbage. That is no problem as we would not
> +	 *       encode the content to UTF-8 on "add" and we would not encode
> +	 *       it back on "checkout".
> +	 *   (2) Git users use different iconv versions that support different
> +	 *       encodings. This could lead to problems, as the content might
> +	 *       not be encoded on "add" but encoded back on "checkout" (or
> +	 *       the other way around).
> +	 * We print a one-time warning to the user in both cases above.
> +	 */

Isn't an error better than "garbage in -> garbage out" ?

> +	if (enc->to_git == invalid_conversion || enc->to_worktree == invalid_conversion)
> +		return 0;
> +
> +	dst = reencode_string_iconv(src, src_len, enc->to_git, &dst_len);

I would suggest to write a function in strbuf and use it here:

int strbuf_add_reencode((struct strbuf *sb, const void *data, size_t len,
                         const char *from, const char *to)

> +	if (!dst)
> +		/*
> +		 * We could add the blob "as-is" to Git. However, on checkout
> +		 * we would try to reencode to the original encoding. This
> +		 * would fail and we would leave the user with a messed-up
> +		 * working tree. Let's try to avoid this by screaming loud.
> +		 */
> +		die(_("failed to encode '%s' from %s to %s"),
> +			path, enc->name, default_encoding);
> +
> +	/*
> +	 * Encode dst back to ensure no information is lost. This wastes

"wastes" ? "uses" i would say.

> +	 * a few cycles as most conversions are round trip conversion
> +	 * safe. However, content that has an invalid encoding might not
> +	 * match its original byte sequence after the UTF-8 conversion
> +	 * round trip. Let's play safe here and check the round trip
> +	 * conversion.
> +	 */
> +	re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
> +	if (!re_src || strcmp(src, re_src)) {
> +		die(_("encoding '%s' from %s to %s and back is not the same"),
> +			path, enc->name, default_encoding);
> +	}
> +	free(re_src);
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +#else
> +	warning(_("cannot encode '%s' from %s to %s because "
> +		"your Git was not compiled with encoding support"),
> +		path, enc->name, default_encoding);
> +	return 0;
> +#endif
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
> +			      struct strbuf *buf, struct encoding *enc)


Similar here: Better to use a (new) function from strbuf.

> +{
> +#ifndef NO_ICONV
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	if (enc->to_worktree == invalid_conversion) {
> +		enc->to_worktree = iconv_open(encoding->name, default_encoding);
> +		if (enc->to_worktree == invalid_conversion)
> +			warning("unsupported encoding %s", encoding->name);
> +	}
> +
> +	/*
> +	 * Do nothing if the encoding is not supported.
> +	 * See detailed explanation in encode_to_git().
> +	 */
> +	if (enc->to_worktree == invalid_conversion)
> +		return 0;
> +
> +	dst = reencode_string_iconv(src, src_len, enc->to_worktree, &dst_len);
> +	if (!dst) {
> +		warning("failed to encode '%s' from %s to %s",
> +			path, default_encoding, encoding->name);
> +		return 0;
> +	}
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +#else
> +	warning(_("cannot encode '%s' from %s to %s because "
> +		"your Git was not compiled with encoding support"),
> +		path, default_encoding, encoding->name);
> +	return 0;
> +#endif
> +}
> +
>  static int crlf_to_git(const struct index_state *istate,
>  		       const char *path, const char *src, size_t len,
>  		       struct strbuf *buf,
> @@ -969,6 +1113,32 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>  	return 1;
>  }
> 
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +	const char *value = check->value;
> +	struct encoding *enc;
> +
> +	if (ATTR_UNSET(value))
> +		return NULL;
> +
> +	for (enc = encoding; enc; enc = enc->next)
> +		if (!strcmp(value, enc->name))
> +			return enc;
> +
> +	/* Don't encode to the default encoding */
> +	if (!strcasecmp(value, default_encoding))
> +		return NULL;
> +
> +	enc = xcalloc(1, sizeof(struct convert_driver));
> +	enc->name = xstrdup(value);
> +	enc->to_git = invalid_conversion;
> +	enc->to_worktree = invalid_conversion;
> +	*encoding_tail = enc;
> +	encoding_tail = &(enc->next);
> +
> +	return enc;
> +}
> +
>  static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
>  {
>  	const char *value = check->value;
> @@ -1024,6 +1194,7 @@ struct conv_attrs {
>  	enum crlf_action attr_action; /* What attr says */
>  	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
>  	int ident;
> +	struct encoding *encoding; /* Supported encoding or default encoding if NULL */
>  };
> 
>  static void convert_attrs(struct conv_attrs *ca, const char *path)
> @@ -1032,8 +1203,9 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> 
>  	if (!check) {
>  		check = attr_check_initl("crlf", "ident", "filter",
> -					 "eol", "text", NULL);
> +					 "eol", "text", "encoding", NULL);
>  		user_convert_tail = &user_convert;
> +		encoding_tail = &encoding;
>  		git_config(read_convert_config, NULL);
>  	}

As written before, "worktree-encoding" should imply text.

> 
> @@ -1055,6 +1227,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  			else if (eol_attr == EOL_CRLF)
>  				ca->crlf_action = CRLF_TEXT_CRLF;
>  		}
> +		ca->encoding = git_path_check_encoding(ccheck + 5);
>  	} else {
>  		ca->drv = NULL;
>  		ca->crlf_action = CRLF_UNDEFINED;
> @@ -1135,6 +1308,13 @@ int convert_to_git(const struct index_state *istate,
>  		src = dst->buf;
>  		len = dst->len;
>  	}
> +
> +	ret |= encode_to_git(path, src, len, dst, ca.encoding);
> +	if (ret && dst) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
>  		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
>  		if (ret && dst) {
> @@ -1158,6 +1338,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
>  	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
> 
> +	encode_to_git(path, dst->buf, dst->len, dst, ca.encoding);
>  	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
>  	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
>  }
> @@ -1189,6 +1370,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  		}
>  	}
> 
> +	ret |= encode_to_worktree(path, src, len, dst, ca.encoding);
> +	if (ret) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	ret_filter = apply_filter(
>  		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
>  	if (!ret_filter && ca.drv && ca.drv->required)
> @@ -1655,6 +1842,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
>  	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
>  		return NULL;
> 
> +	if (ca.encoding)
> +		return NULL;
> +
>  	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
>  		return NULL;
> 

> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh

(I didn't review the test yet)

Another comment for a possible improvement:
"git diff"  could be told to write the worktree-encoding into the diff,
and "git apply" can pick that up. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-11 20:47 ` Johannes Sixt
  2017-12-11 23:42   ` Lars Schneider
@ 2017-12-18 10:13   ` Torsten Bögershausen
  2017-12-18 13:12     ` Jeff King
                       ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Torsten Bögershausen @ 2017-12-18 10:13 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: lars.schneider, git, gitster, peff, patrick, Lars Schneider

On Mon, Dec 11, 2017 at 09:47:24PM +0100, Johannes Sixt wrote:
> Am 11.12.2017 um 16:50 schrieb lars.schneider@autodesk.com:
> >From: Lars Schneider <larsxschneider@gmail.com>
> >
> >Git and its tools (e.g. git diff) expect all text files in UTF-8
> >encoding. Git will happily accept content in all other encodings, too,
> >but it might not be able to process the text (e.g. viewing diffs or
> >changing line endings).
> >
> >Add an attribute to tell Git what encoding the user has defined for a
> >given file. If the content is added to the index, then Git converts the
> >content to a canonical UTF-8 representation. On checkout Git will
> >reverse the conversion.
> >
> >Reviewed-by: Patrick Lühne <patrick@luehne.de>
> >Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> >---
> >
> >Hi,
> >
> >here is a WIP patch to add text encoding support for files encoded with
> >something other than UTF-8 [RFC].
> >
> >The 'encoding' attribute is already used to view blobs in gitk. That
> >could be a problem as the content is stored in Git with the defined
> >encoding. This patch would interpret the content as UTF-8 encoded and
> 
> This will be a major drawback for me because my code base stores text files
> that are not UTF-8 encoded. And I do use the existing 'encoding' attribute
> to view the text in git-gui and gitk. Repurposing this attribute name is not
> an option, IMO.

Just to confirm my missing knowledge here:
Does this mean, that git-gui and gitk can decode/reencode
the content of a file/blob, when the .gitattributes say so ?

If yes, would it make sense to enhance the "git diff" instead ?
"git diff --encoding" will pick up the commited encoding from
.attributes, convert it into UTF-8, and run the diff ?
We actually could enhance the "git diff" output with a single
line saying
"Git index-encoding=cp1251"
or so, which can be picked up by "git apply".

The advantage would be that we could continue to commit in UTF-16
as before, and avoid the glitches with .gitattributes, that Peff
pointed out.

Does this make sense ?

> 
> >it would try to reencode it to the defined encoding on checkout > Plus,
> >many repos define the attribute very broad (e.g. "*
> encoding=cp1251").

Is this a user mistake ?

> >These folks would see errors like these with my patch:
> >     error: failed to encode 'foo.bar' from utf-8 to cp1251
> 
> -- Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-15  9:58 ` Jeff King
@ 2017-12-18 10:54   ` Lars Schneider
  2017-12-18 12:59     ` Jeff King
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-18 10:54 UTC (permalink / raw)
  To: Jeff King; +Cc: lars.schneider, git, gitster, tboegi, patrick


> On 15 Dec 2017, at 10:58, Jeff King <peff@peff.net> wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
> 
> This made me wonder what happens when you have a file that _was_ utf16,
> and then you later convert it to utf8 and add a .gitattributes entry.
> 
> I tried this with your patch:
> 
>  git init repo
>  cd repo
> 
>  echo foo | iconv -t utf16 >file
>  git add file
>  git commit -m utf16
> 
>  echo 'file encoding=utf16' >.gitattributes
>  touch file ;# make it stat-dirty
>  git commit -m convert
> 
>  git checkout HEAD^
> 
> That works OK, because we try to read the attributes from the
> destination tree for a checkout. If you do this:
> 
>  echo 'file encoding=utf16' >.git/info/attributes
>  git checkout HEAD^
> 
> then we get:
> 
>  warning: failed to encode 'file' from utf-8 to utf16
> 
> At least it figured out that it couldn't convert the content. It's
> slightly troubling that it would try in the first place, though; are
> there encoding pairs where we might accidentally generate nonsense?

At this point we interpret utf-16 content as utf-8 and try to convert
it to utf-16. That of course fails because utf-16 content is no valid
utf-8. How could we stop trying that? How could Git possibly know what
kind of encoding is used (apart from our new hint in gitattributes)?

I asked myself the question about nonsense encoding pairs, too. That
was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git
add. However, with ".git/info/attributes" you could circumvent that
check and probably generate nonsense.


> It _should_ be uncommon, I think, to have a repo-level attribute set
> like that. It's very common for us to use whatever happens to be in the
> checked-out .gitattributes for some attributes (e.g., when doing a diff
> of an older commit), but I think for checkout-related ones it's not.  So
> I think it may generally work in practice. And certainly the line-ending
> code would share any similar problems, but at least there the result is
> less confusing than mojibake.
> 
> Playing around, I also managed to do this:
> 
>  echo 'file encoding=utf16' >.gitattributes
>  echo foo >file
> 
>  # I did these with an old version of git that didn't
>  # support the new attribute, so they blindly added the utf8 content.
>  git add .
>  git commit -m convert
> 
>  git.compile checkout HEAD^
> 
> which yielded:
> 
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> Now obviously my situation is somewhat nonsense. I was trying to force
> the in-repo representation to utf8, but ended up with a mismatched
> working tree file. But what's somewhat troubling is that I couldn't
> checkout _away_ from that state due to the die() in convert_to_git().
> Which is in turn just there as part of refresh_index().
> 
> And indeed, other commands hit the same problem:
> 
>  $ git.compile diff
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
>  $ git.compile checkout -f
>  fatal: encoding 'file' from utf16 to utf-8 and back is not the same
> 
> It may make sense to die() during "git add ." (since we're actually
> changing the index entry, and we don't want to put nonsense into a
> tree). But I'm not sure it's the best thing for operations which just
> want to read the content. For them, perhaps it would be more appropriate
> to issue a warning and return the untouched content.

Absolutely! Thanks for spotting this. I will try to run die() only on
"git add" in v2.

- Lars

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-18 10:54   ` Lars Schneider
@ 2017-12-18 12:59     ` Jeff King
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff King @ 2017-12-18 12:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: lars.schneider, git, gitster, tboegi, patrick

On Mon, Dec 18, 2017 at 11:54:32AM +0100, Lars Schneider wrote:

> >  warning: failed to encode 'file' from utf-8 to utf16
> > 
> > At least it figured out that it couldn't convert the content. It's
> > slightly troubling that it would try in the first place, though; are
> > there encoding pairs where we might accidentally generate nonsense?
> 
> At this point we interpret utf-16 content as utf-8 and try to convert
> it to utf-16. That of course fails because utf-16 content is no valid
> utf-8. How could we stop trying that? How could Git possibly know what
> kind of encoding is used (apart from our new hint in gitattributes)?

Yeah, sorry if I wasn't clear: I don't really have an answer to those
questions either. So this is probably the best we can do. I was mostly
just trying to think through the worst case, and what could go wrong.

> > It may make sense to die() during "git add ." (since we're actually
> > changing the index entry, and we don't want to put nonsense into a
> > tree). But I'm not sure it's the best thing for operations which just
> > want to read the content. For them, perhaps it would be more appropriate
> > to issue a warning and return the untouched content.
> 
> Absolutely! Thanks for spotting this. I will try to run die() only on
> "git add" in v2.

Great, thanks!

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-18 10:13   ` Torsten Bögershausen
@ 2017-12-18 13:12     ` Jeff King
  2017-12-23  8:08       ` Torsten Bögershausen
                         ` (4 more replies)
  2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
  2017-12-18 21:55     ` Johannes Sixt
  2 siblings, 5 replies; 36+ messages in thread
From: Jeff King @ 2017-12-18 13:12 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Johannes Sixt, lars.schneider, git, gitster, patrick,
	Lars Schneider

On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote:

> Just to confirm my missing knowledge here:
> Does this mean, that git-gui and gitk can decode/reencode
> the content of a file/blob, when the .gitattributes say so ?

That's my impression, yes.

> If yes, would it make sense to enhance the "git diff" instead ?
> "git diff --encoding" will pick up the commited encoding from
> .attributes, convert it into UTF-8, and run the diff ?

You can do something like this already:

  git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
  echo file diff=utf16 >.gitattributes

I have no objection to teaching it that "encoding" means roughly the
same thing, which would have two advantages:

 - we could do it in-process, which would be much faster

 - we could skip the extra config step, which is a blocker for
   server-side diffs on sites like GitHub

I don't think you'd really need "--encoding". This could just be
triggered by the normal "--textconv" rules (i.e., just treat this "as
if" the textconv above was configured when we see an encoding
attribute).

> We actually could enhance the "git diff" output with a single
> line saying
> "Git index-encoding=cp1251"
> or so, which can be picked up by "git apply".

That extends it beyond what textconv can do (we assume that textconv
patches are _just_ for human consumption, and can't be applied). And it
would mean you'd potentially want to trigger it in more places. On the
other hand, I don't know that we're guaranteed that a round-trip
between encodings will produce a byte-wise identical result. The nice
thing about piggy-backing on textconv is that it's already dealt with
that problem.

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-18 10:13   ` Torsten Bögershausen
  2017-12-18 13:12     ` Jeff King
@ 2017-12-18 18:02     ` Junio C Hamano
  2017-12-18 21:55     ` Johannes Sixt
  2 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2017-12-18 18:02 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Johannes Sixt, lars.schneider, git, peff, patrick, Lars Schneider

Torsten Bögershausen <tboegi@web.de> writes:

> Just to confirm my missing knowledge here:
> Does this mean, that git-gui and gitk can decode/reencode
> the content of a file/blob, when the .gitattributes say so ?

These programs, when told that a file is in an encoding, read bytes
from that file and interpret that byte sequence in the given
encoding, and blits the corresponding glyphs on the screen---giving
UTF-8 out is not the primary feature of them, so I am not sure
decode/reencode is a good term to use here.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-18 10:13   ` Torsten Bögershausen
  2017-12-18 13:12     ` Jeff King
  2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
@ 2017-12-18 21:55     ` Johannes Sixt
  2 siblings, 0 replies; 36+ messages in thread
From: Johannes Sixt @ 2017-12-18 21:55 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: lars.schneider, git, gitster, peff, patrick, Lars Schneider

Am 18.12.2017 um 11:13 schrieb Torsten Bögershausen:
> Just to confirm my missing knowledge here:
> Does this mean, that git-gui and gitk can decode/reencode
> the content of a file/blob, when the .gitattributes say so ?

No. I think they parse the output of git-diff et.al., split it per file, 
and then consult .gitattributes to interpret the byte sequence according 
to the encoding.

-- Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-18 13:12     ` Jeff King
@ 2017-12-23  8:08       ` Torsten Bögershausen
  2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Torsten Bögershausen @ 2017-12-23  8:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Sixt, lars.schneider, git, gitster, patrick,
	Lars Schneider

On Mon, Dec 18, 2017 at 08:12:49AM -0500, Jeff King wrote:
> On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote:
> 
> > Just to confirm my missing knowledge here:
> > Does this mean, that git-gui and gitk can decode/reencode
> > the content of a file/blob, when the .gitattributes say so ?
> 
> That's my impression, yes.
> 
> > If yes, would it make sense to enhance the "git diff" instead ?
> > "git diff --encoding" will pick up the commited encoding from
> > .attributes, convert it into UTF-8, and run the diff ?
> 
> You can do something like this already:
> 
>   git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
>   echo file diff=utf16 >.gitattributes
> 
> I have no objection to teaching it that "encoding" means roughly the
> same thing, which would have two advantages:
> 
>  - we could do it in-process, which would be much faster
> 
>  - we could skip the extra config step, which is a blocker for
>    server-side diffs on sites like GitHub
> 
> I don't think you'd really need "--encoding". This could just be
> triggered by the normal "--textconv" rules (i.e., just treat this "as
> if" the textconv above was configured when we see an encoding
> attribute).

I can probably come up with a patch the next weeks or so.

> 
> > We actually could enhance the "git diff" output with a single
> > line saying
> > "Git index-encoding=cp1251"
> > or so, which can be picked up by "git apply".
> 
> That extends it beyond what textconv can do (we assume that textconv
> patches are _just_ for human consumption, and can't be applied). And it
> would mean you'd potentially want to trigger it in more places. On the
> other hand, I don't know that we're guaranteed that a round-trip
> between encodings will produce a byte-wise identical result. The nice
> thing about piggy-backing on textconv is that it's already dealt with
> that problem.
> 
> -Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-17 17:14 ` Torsten Bögershausen
@ 2017-12-28 16:14   ` Lars Schneider
  2017-12-29 12:59     ` Torsten Bögershausen
  0 siblings, 1 reply; 36+ messages in thread
From: Lars Schneider @ 2017-12-28 16:14 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, gitster, peff, patrick


> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 

>> +`encoding`
>> +^^^^^^^^^^
>> +
>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
> 
> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.

I am not sure I understand what you want to tell me here.
Do you think UTF-8 encoding is not correct in the sentence above?

> 
>> +attribute sets the encoding to be used in the working directory.
>> +If the path is added to the index, then Git encodes the content to
>> +UTF-8.  On checkout the content is encoded back to the original
>> +encoding.  Consequently, you can use all built-in Git text processing
>> +tools (e.g. git diff, line ending conversions, etc.) with your
>> +non-UTF-8 encoded file.
>> +
>> +Please note that re-encoding content can cause errors and requires
>> +resources. Use the `encoding` attribute only if you cannot store
>> +a file in UTF-8 encoding and if you want Git to be able to process
>> +the text.
>> +
>> +------------------------
>> +*.txt		text encoding=UTF-16
>> +------------------------
> 
> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
> (That is not done in convert.c, but assuming binary or even auto
> does not make much sense to me)

"text" only means "LF". "-text" means CRLF which would be perfectly fine
for UTF-16. Therefore I don't think we need to imply text.
Does this make sense?

> 
> 
>> +
>> +All `iconv` encodings with a stable round-trip conversion to and from
>> +UTF-8 are supported.  You can see a full list with the following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> `eol`
>> ^^^^^
>> 
>> diff --git a/convert.c b/convert.c
>> index 20d7ab67bd..ee19c17104 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -7,6 +7,7 @@
>> #include "sigchain.h"
>> #include "pkt-line.h"
>> #include "sub-process.h"
>> +#include "utf8.h"
>> 
>> /*
>>  * convert.c - convert a file when checking it out and checking it in.
>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> 
>> }
> 
> I would avoid to use these #ifdefs here.
> All functions can be in strbuf.c (and may have #ifdefs there), but not here.

I'll try that. But wouldn't it make more sense to move the functions to utf.c?

> 
>> 
>> +#ifdef NO_ICONV
>> +#ifndef _ICONV_T
>> +/* The type is just a placeholder and not actually used. */
>> +typedef void* iconv_t;
>> +#endif
>> +#endif
>> +
>> +static struct encoding {
>> +	const char *name;
>> +	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
>> +	iconv_t to_worktree;  /* conversion to user-defined encoding */
>> +	struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> This seems like an optimazation, assuning that iconv_open() eats a lot of ressources.
> I don't think this is the case. (Undere MacOS we run iconv_open() together with
> every opendir(), and optimizing this out did not show any measurable improvements)

True, but then I would need to free() the memory in a lot of places.
Therefore I thought it is easier to keep the object. OK for you?


>> +static const char *default_encoding = "utf-8";
> 
> The most portable form is "UTF-8" (correct me if that is wrong)

It shouldn't matter. But I've changed it to uppercase to be on the safe side.


>> +static iconv_t invalid_conversion = (iconv_t)-1;
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +			 struct strbuf *buf, struct encoding *enc)
>> +{
>> +#ifndef NO_ICONV
>> +	char *dst, *re_src;
>> +	int dst_len, re_src_len;
>> +
>> +	/*
>> +	 * No encoding is specified or there is nothing to encode.
>> +	 * Tell the caller that the content was not modified.
>> +	 */
>> +	if (!enc || (src && !src_len))
>> +		return 0;
>> +
>> +	/*
>> +	 * Looks like we got called from "would_convert_to_git()".
>> +	 * This means Git wants to know if it would encode (= modify!)
>> +	 * the content. Let's answer with "yes", since an encoding was
>> +	 * specified.
>> +	 */
>> +	if (!buf && !src)
>> +		return 1;
>> +
>> +	if (enc->to_git == invalid_conversion) {
>> +		enc->to_git = iconv_open(default_encoding, encoding->name);
>> +		if (enc->to_git == invalid_conversion)
>> +			warning(_("unsupported encoding %s"), encoding->name);
>> +	}
> 
> 	/* There are 2 different types of reaction:
> 	      Either users  know what that a warning means: You asked for problems,
> 	      	        and do the right thing. Other may may ignore the warning
> 			       - in both cases an error is useful */

Agreed!


>> +	if (enc->to_worktree == invalid_conversion)
>> +		enc->to_worktree = iconv_open(encoding->name, default_encoding);
>> +
>> +	/*
>> +	 * Do nothing if the encoding is not supported. This could happen in
>> +	 * two cases:
>> +	 *   (1) The encoding is garbage. That is no problem as we would not
>> +	 *       encode the content to UTF-8 on "add" and we would not encode
>> +	 *       it back on "checkout".
>> +	 *   (2) Git users use different iconv versions that support different
>> +	 *       encodings. This could lead to problems, as the content might
>> +	 *       not be encoded on "add" but encoded back on "checkout" (or
>> +	 *       the other way around).
>> +	 * We print a one-time warning to the user in both cases above.
>> +	 */
> 
> Isn't an error better than "garbage in -> garbage out" ?

Agreed. I changed the warning to an error.


>> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
> 
> (I didn't review the test yet)
> 
> Another comment for a possible improvement:
> "git diff"  could be told to write the worktree-encoding into the diff,
> and "git apply" can pick that up. 

Yes, we could do that. However, I would tackle that in a separate series.


Thanks,
Lars




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-28 16:14   ` Lars Schneider
@ 2017-12-29 12:59     ` Torsten Bögershausen
  2017-12-29 13:56       ` Lars Schneider
  2018-01-03 19:15       ` Junio C Hamano
  0 siblings, 2 replies; 36+ messages in thread
From: Torsten Bögershausen @ 2017-12-29 12:59 UTC (permalink / raw)
  To: Lars Schneider; +Cc: Lars Schneider, Git List, gitster, peff, patrick

On 2017-12-28 17:14, Lars Schneider wrote:> 
>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
>>
>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>
> 
>>> +`encoding`
>>> +^^^^^^^^^^
>>> +
>>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>
>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
> 
> I am not sure I understand what you want to tell me here.
> Do you think UTF-8 encoding is not correct in the sentence above?

Git itself was designed to handle source code in ASCII.
Text files which are encoded in ISO-8859-1, UTF-8 or any
super-set of ASCII are handled as well, and what exactly to do
with bytes >0x80 is outside the responsibility of Git.
E.g. the interpretation and rendering on the screen may be
dependent on the locale or being guessed.
All in all, saying that Git expects UTF-8 may be "overdriven".
Any kind of file that uses an '\n' as an end of line
and has no NUL bytes '\0' works as a text file in Git.
(End of BlaBla ;-)

We can probably replace:
"By default Git assumes UTF-8 encoding for text files"

with something like
"Git handles UTF-8 encoded files as text files, but UTF-16 encoded
 files are handled as binary files"

>>
>>> +attribute sets the encoding to be used in the working directory.
>>> +If the path is added to the index, then Git encodes the content to
>>> +UTF-8.  On checkout the content is encoded back to the original
>>> +encoding.  Consequently, you can use all built-in Git text processing
>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>> +non-UTF-8 encoded file.
>>> +
>>> +Please note that re-encoding content can cause errors and requires
>>> +resources. Use the `encoding` attribute only if you cannot store
>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>> +the text.
>>> +
>>> +------------------------
>>> +*.txt		text encoding=UTF-16
>>> +------------------------
>>
>> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
>> (That is not done in convert.c, but assuming binary or even auto
>> does not make much sense to me)
> 
> "text" only means "LF". "-text" means CRLF which would be perfectly fine
> for UTF-16. Therefore I don't think we need to imply text.
> Does this make sense?
Yes and now.

"text" means convert CRLF at "git add" (or commit) into LF,
And potentially LF into CRLF at checkout,
depending on the EOL attribute (if set), core.autocrlf and/or core.eol.

"-text" means don't touch CRLF or LF at all. Files with CRLF are commited
with CRLF.
Running a Unix like "diff" tool will often show "^M" to indicate that there
is a CR before the LF, which may be distracting or confusing.

If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
it makes sense to convert the line endings into LF before storing them
into the index (at least this is my experience).

(Not specifying "text" or "-text" at all will trigger the auto-handling,
 which is not needed at all).
So if we want to be sure that line endings are CRLF in the worktree we
should ask the user to specify things like this:

*.ext worktree-encoding=UTF-16 text eol=CRLF

It may be enough to give this example in the documentation.
and I would rather be over-careful here, to avoid future problems.

> 
>>
>>
>>> +
>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>> +UTF-8 are supported.  You can see a full list with the following command:
>>> +
>>> +------------------------
>>> +iconv --list
>>> +------------------------
>>> +
>>> `eol`
>>> ^^^^^
>>>
>>> diff --git a/convert.c b/convert.c
>>> index 20d7ab67bd..ee19c17104 100644
>>> --- a/convert.c
>>> +++ b/convert.c
>>> @@ -7,6 +7,7 @@
>>> #include "sigchain.h"
>>> #include "pkt-line.h"
>>> #include "sub-process.h"
>>> +#include "utf8.h"
>>>
>>> /*
>>>  * convert.c - convert a file when checking it out and checking it in.
>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>>>
>>> }
>>
>> I would avoid to use these #ifdefs here.
>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
> 
> I'll try that. But wouldn't it make more sense to move the functions to utf.c?

May be.
Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
Especially it didn't know anything about strbuf.
I don't know why strbuf.h and other functions had been added here,

I once moved them into strbuf.c without any problems, but never send out
a patch, because of possible merge conflicts in ongoing patches.

In any case, if it is about strbuf, I would try to put it into strbuf.c

>>
>>>
>>> +#ifdef NO_ICONV
>>> +#ifndef _ICONV_T
>>> +/* The type is just a placeholder and not actually used. */
>>> +typedef void* iconv_t;
>>> +#endif
>>> +#endif
>>> +
>>> +static struct encoding {
>>> +	const char *name;
>>> +	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
>>> +	iconv_t to_worktree;  /* conversion to user-defined encoding */
>>> +	struct encoding *next;
>>> +} *encoding, **encoding_tail;
>>
>> This seems like an optimazation, assuning that iconv_open() eats a lot of ressources.
>> I don't think this is the case. (Undere MacOS we run iconv_open() together with
>> every opendir(), and optimizing this out did not show any measurable improvements)
> 
> True, but then I would need to free() the memory in a lot of places.How many ?

> Therefore I thought it is easier to keep the object. OK for you?

My first impression was that this optimization made the review hard to do.
I would rather see a clean alloc/free open/close scheme in one patch.
This allows to follow the logic.
And, if needed, an optimizing patch on top of that, saying now
we saved 30 lines of code.

> 
> 
>>> +static const char *default_encoding = "utf-8";
>>
>> The most portable form is "UTF-8" (correct me if that is wrong)
> 
> It shouldn't matter. But I've changed it to uppercase to be on the safe side.Good.

For the interested reader, see utf8.c:
static const char *fallback_encoding(const char *name)
{
	/*
	 * Some platforms do not have the variously spelled variants of
	 * UTF-8, so let's fall back to trying the most official
	 * spelling. We do so only as a fallback in case the platform
	 * does understand the user's spelling, but not our official
	 * one.
	 */

 
> 
>>> +static iconv_t invalid_conversion = (iconv_t)-1;
>>> +
>>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>>> +			 struct strbuf *buf, struct encoding *enc)
>>> +{
>>> +#ifndef NO_ICONV
>>> +	char *dst, *re_src;
>>> +	int dst_len, re_src_len;
>>> +
>>> +	/*
>>> +	 * No encoding is specified or there is nothing to encode.
>>> +	 * Tell the caller that the content was not modified.
>>> +	 */
>>> +	if (!enc || (src && !src_len))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * Looks like we got called from "would_convert_to_git()".
>>> +	 * This means Git wants to know if it would encode (= modify!)
>>> +	 * the content. Let's answer with "yes", since an encoding was
>>> +	 * specified.
>>> +	 */
>>> +	if (!buf && !src)
>>> +		return 1;
>>> +
>>> +	if (enc->to_git == invalid_conversion) {
>>> +		enc->to_git = iconv_open(default_encoding, encoding->name);
>>> +		if (enc->to_git == invalid_conversion)
>>> +			warning(_("unsupported encoding %s"), encoding->name);
>>> +	}
>>
>> 	/* There are 2 different types of reaction:
>> 	      Either users  know what that a warning means: You asked for problems,
>> 	      	        and do the right thing. Other may may ignore the warning
>> 			       - in both cases an error is useful */
> 
> Agreed!
> 
> 
>>> +	if (enc->to_worktree == invalid_conversion)
>>> +		enc->to_worktree = iconv_open(encoding->name, default_encoding);
>>> +
>>> +	/*
>>> +	 * Do nothing if the encoding is not supported. This could happen in
>>> +	 * two cases:
>>> +	 *   (1) The encoding is garbage. That is no problem as we would not
>>> +	 *       encode the content to UTF-8 on "add" and we would not encode
>>> +	 *       it back on "checkout".
>>> +	 *   (2) Git users use different iconv versions that support different
>>> +	 *       encodings. This could lead to problems, as the content might
>>> +	 *       not be encoded on "add" but encoded back on "checkout" (or
>>> +	 *       the other way around).
>>> +	 * We print a one-time warning to the user in both cases above.
>>> +	 */
>>
>> Isn't an error better than "garbage in -> garbage out" ?
> 
> Agreed. I changed the warning to an error.
> 
> 
>>> diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
>>
>> (I didn't review the test yet)
>>
>> Another comment for a possible improvement:
>> "git diff"  could be told to write the worktree-encoding into the diff,
>> and "git apply" can pick that up. 
> 
> Yes, we could do that. However, I would tackle that in a separate series.
> 
> 
> Thanks,
> Lars
> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH/RFC 0/2] git diff --UTF-8
  2017-12-18 13:12     ` Jeff King
  2017-12-23  8:08       ` Torsten Bögershausen
@ 2017-12-29 13:28       ` tboegi
  2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: tboegi @ 2017-12-29 13:28 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

RFC patch: convert files from e.g. UTF-16 into UTF-8 while running
"git diff".
The diff must be called with "git diff --UTF-8" and the "encoding"
attribute must be set for the file(s).

The commit messages may need some improvements, and a closer look
at diff.c, how command line options are forwared, is appreciated.

It may even be possible to integrate t4066 somewhere...

Torsten Bögershausen (2):
  convert_to_git(): checksafe becomes an integer
  git diff: Allow to reencode into UTF-8

 Documentation/diff-options.txt  |  4 ++
 Documentation/gitattributes.txt |  9 +++++
 apply.c                         |  4 +-
 convert.c                       | 60 +++++++++++++++++++++++-----
 convert.h                       | 20 +++++-----
 diff.c                          | 40 +++++++++++++++++--
 diff.h                          |  1 +
 diffcore.h                      |  3 ++
 environment.c                   |  2 +-
 sha1_file.c                     |  6 +--
 t/t4066-diff-encoding.sh        | 86 +++++++++++++++++++++++++++++++++++++++++
 11 files changed, 205 insertions(+), 30 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

-- 
2.15.1.271.g1a4e40aa5d


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer
  2017-12-18 13:12     ` Jeff King
  2017-12-23  8:08       ` Torsten Bögershausen
  2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
@ 2017-12-29 13:28       ` tboegi
  2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
  2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
  4 siblings, 0 replies; 36+ messages in thread
From: tboegi @ 2017-12-29 13:28 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Today a small flaw is found in the code base:
An integer with the value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

In the next commit there is a need to turn checksafe into a bitmap, which
allows to tell convert_to_git() to obey the encoding attribute or not.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 apply.c       |  4 ++--
 convert.c     | 20 ++++++++++----------
 convert.h     | 18 ++++++++----------
 diff.c        |  4 ++--
 environment.c |  2 +-
 sha1_file.c   |  6 +++---
 6 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..a422516062 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,7 +2263,7 @@ static void show_stats(struct apply_state *state, struct patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 			 const char *path, struct strbuf *buf)
 {
-	enum safe_crlf safe_crlf = patch->crlf_in_old ?
+	int checksafe = patch->crlf_in_old ?
 		SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
 	switch (st->st_mode & S_IFMT) {
 	case S_IFLNK:
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, checksafe);
 		return 0;
 	default:
 		return -1;
diff --git a/convert.c b/convert.c
index 1a41a48e15..5efcc3b73b 100644
--- a/convert.c
+++ b/convert.c
@@ -195,13 +195,13 @@ static enum eol output_eol(enum crlf_action crlf_action)
 
 static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
 			    struct text_stat *old_stats, struct text_stat *new_stats,
-			    enum safe_crlf checksafe)
+			    int checksafe)
 {
 	if (old_stats->crlf && !new_stats->crlf ) {
 		/*
 		 * CRLFs would not be restored by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (checksafe & SAFE_CRLF_WARN)
 			warning(_("CRLF will be replaced by LF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
@@ -211,7 +211,7 @@ static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
 		/*
 		 * CRLFs would be added by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (checksafe & SAFE_CRLF_WARN)
 			warning(_("LF will be replaced by CRLF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
@@ -268,7 +268,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
-		       enum crlf_action crlf_action, enum safe_crlf checksafe)
+		       enum crlf_action crlf_action, int checksafe)
 {
 	struct text_stat stats;
 	char *dst;
@@ -298,12 +298,12 @@ static int crlf_to_git(const struct index_state *istate,
 		 * unless we want to renormalize in a merge or
 		 * cherry-pick.
 		 */
-		if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+		if ((!(checksafe & SAFE_CRLF_RENORMALIZE)) &&
 		    has_crlf_in_index(istate, path))
 			convert_crlf_into_lf = 0;
 	}
-	if ((checksafe == SAFE_CRLF_WARN ||
-	    (checksafe == SAFE_CRLF_FAIL)) && len) {
+	if (((checksafe & SAFE_CRLF_WARN) ||
+	     ((checksafe & SAFE_CRLF_FAIL) && len))) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
@@ -1129,7 +1129,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, int checksafe)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1144,7 +1144,7 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
-	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+	if (!(checksafe & SAFE_CRLF_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
 		if (ret && dst) {
 			src = dst->buf;
@@ -1156,7 +1156,7 @@ int convert_to_git(const struct index_state *istate,
 
 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      int checksafe)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
diff --git a/convert.h b/convert.h
index 4f2da225a8..532af00423 100644
--- a/convert.h
+++ b/convert.h
@@ -8,15 +8,13 @@
 
 struct index_state;
 
-enum safe_crlf {
-	SAFE_CRLF_FALSE = 0,
-	SAFE_CRLF_FAIL = 1,
-	SAFE_CRLF_WARN = 2,
-	SAFE_CRLF_RENORMALIZE = 3,
-	SAFE_CRLF_KEEP_CRLF = 4
-};
+#define SAFE_CRLF_FALSE       0
+#define SAFE_CRLF_FAIL        (1<<0)
+#define SAFE_CRLF_WARN        (1<<1)
+#define SAFE_CRLF_RENORMALIZE (1<<2)
+#define SAFE_CRLF_KEEP_CRLF   (1<<3)
 
-extern enum safe_crlf safe_crlf;
+extern int safe_crlf;
 
 enum auto_crlf {
 	AUTO_CRLF_FALSE = 0,
@@ -66,7 +64,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, int checksafe);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -85,7 +83,7 @@ static inline int would_convert_to_git(const struct index_state *istate,
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     int checksafe);
 extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
diff --git a/diff.c b/diff.c
index fb22b19f09..5e3aaea6e0 100644
--- a/diff.c
+++ b/diff.c
@@ -3524,7 +3524,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
 	 */
-	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
+	int checksafe = (safe_crlf == SAFE_CRLF_FAIL
 				    ? SAFE_CRLF_WARN
 				    : safe_crlf);
 
@@ -3603,7 +3603,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, checksafe)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..10ee89a28d 100644
--- a/environment.c
+++ b/environment.c
@@ -49,7 +49,7 @@ enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
 int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
-enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
+int safe_crlf = SAFE_CRLF_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/sha1_file.c b/sha1_file.c
index 3da70ac650..78e002392e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -133,7 +133,7 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
 }
 
 
-static enum safe_crlf get_safe_crlf(unsigned flags)
+static int get_checksafe(unsigned flags)
 {
 	if (flags & HASH_RENORMALIZE)
 		return SAFE_CRLF_RENORMALIZE;
@@ -1752,7 +1752,7 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_checksafe(flags))) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1786,7 +1786,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));
 
 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_checksafe(flags));
 
 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
-- 
2.15.1.271.g1a4e40aa5d


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8
  2017-12-18 13:12     ` Jeff King
                         ` (2 preceding siblings ...)
  2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
@ 2017-12-29 13:28       ` tboegi
  2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
  4 siblings, 0 replies; 36+ messages in thread
From: tboegi @ 2017-12-29 13:28 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When blobs are encoded in UTF-16, `git diff` will treat them as binary.
Make it possible to show a user readable diff encoded in UTF-8.
This allows to run git diff and feed the into a web sever.

Improve Git to look at the "encodig" attribute and to reencode the
content into UTF-8 before running the diff itself.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 Documentation/diff-options.txt  |  4 ++
 Documentation/gitattributes.txt |  9 +++++
 convert.c                       | 40 +++++++++++++++++++
 convert.h                       |  2 +
 diff.c                          | 38 ++++++++++++++++--
 diff.h                          |  1 +
 diffcore.h                      |  3 ++
 t/t4066-diff-encoding.sh        | 86 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 180 insertions(+), 3 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 9d1586b956..bf2f115f11 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -629,6 +629,10 @@ endif::git-format-patch[]
 	linkgit:git-log[1], but not for linkgit:git-format-patch[1] or
 	diff plumbing commands.
 
+--UTF-8::
+	Git converts the content into UTF-8 before running the diff when the
+	"encoding" attribute is defined. See linkgit:gitattributes[5]
+
 --ignore-submodules[=<when>]::
 	Ignore changes to submodules in the diff generation. <when> can be
 	either "none", "untracked", "dirty" or "all", which is the default.
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..753a7c39b7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -881,6 +881,15 @@ advantages to choosing this method:
 3. Caching. Textconv caching can speed up repeated diffs, such as those
    you might trigger by running `git log -p`.
 
+Running diff on UTF-16 encoded files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Git can convert UTF-16 encoded into UTF-8 before they are feed
+into the diff machinery: `diff --UTF-8 file.xxx`.
+
+------------------------
+file.xxx encoding=UTF-16
+------------------------
 
 Marking files as binary
 ^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/convert.c b/convert.c
index 5efcc3b73b..45577ce504 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -734,6 +735,34 @@ static struct convert_driver {
 	int required;
 } *user_convert, **user_convert_tail;
 
+const char *get_encoding_attr(const char *path)
+{
+	static struct attr_check *check;
+	if (!check)
+		check = attr_check_initl("encoding", NULL);
+	if (!git_check_attr(path, check)) {
+		struct attr_check_item *ccheck = check->items;
+		const char *value;
+		value = ccheck->value;
+		if (ATTR_UNSET(value))
+			return NULL;
+		return value;
+	}
+	return NULL;
+}
+
+static int reencode_into_strbuf(const char *path, const char *src, size_t len,
+				struct strbuf *dst, const char *encoding)
+{
+	int outsz = 0;
+	char *buf;
+	buf = reencode_string_len(src, (int)len, "UTF-8", encoding, &outsz);
+	if (!buf)
+		return 0;
+	strbuf_attach(dst, buf, outsz, outsz);
+	return SAFE_CRLF_REENCODE;
+}
+
 static int apply_filter(const char *path, const char *src, size_t len,
 			int fd, struct strbuf *dst, struct convert_driver *drv,
 			const unsigned int wanted_capability,
@@ -1136,6 +1165,17 @@ int convert_to_git(const struct index_state *istate,
 
 	convert_attrs(&ca, path);
 
+	if (checksafe & SAFE_CRLF_REENCODE) {
+		const char *encoding = get_encoding_attr(path);
+		if (encoding) {
+			ret |= reencode_into_strbuf(path, src, len, dst,
+						    encoding);
+			if (ret && dst) {
+				src = dst->buf;
+				len = dst->len;
+			}
+		}
+	}
 	ret |= apply_filter(path, src, len, -1, dst, ca.drv, CAP_CLEAN, NULL);
 	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
diff --git a/convert.h b/convert.h
index 532af00423..0b093715c9 100644
--- a/convert.h
+++ b/convert.h
@@ -13,6 +13,7 @@ struct index_state;
 #define SAFE_CRLF_WARN        (1<<1)
 #define SAFE_CRLF_RENORMALIZE (1<<2)
 #define SAFE_CRLF_KEEP_CRLF   (1<<3)
+#define SAFE_CRLF_REENCODE    (1<<4)
 
 extern int safe_crlf;
 
@@ -60,6 +61,7 @@ extern const char *get_cached_convert_stats_ascii(const struct index_state *ista
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
 extern const char *get_convert_attr_ascii(const char *path);
+extern const char *get_encoding_attr(const char *path);
 
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
diff --git a/diff.c b/diff.c
index 5e3aaea6e0..07480a465c 100644
--- a/diff.c
+++ b/diff.c
@@ -3191,6 +3191,12 @@ static void builtin_diff(const char *name_a,
 					 header.buf, header.len, 0);
 			strbuf_reset(&header);
 		}
+		if (one && one->reencoded_to_utf8)
+		  strbuf_addf(&header, "a is converted to UTF-8 from %s\n",
+			      get_encoding_attr(one->path));
+		if (two && two->reencoded_to_utf8)
+		  strbuf_addf(&header, "b is converted to UTF-8 from %s\n",
+			      get_encoding_attr(two->path));
 
 		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
 		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
@@ -3520,6 +3526,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 {
 	int size_only = flags & CHECK_SIZE_ONLY;
 	int err = 0;
+	int ret = 0;
 	/*
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
@@ -3527,7 +3534,8 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 	int checksafe = (safe_crlf == SAFE_CRLF_FAIL
 				    ? SAFE_CRLF_WARN
 				    : safe_crlf);
-
+	if (s->reencode_to_utf8)
+		checksafe |= SAFE_CRLF_REENCODE;
 	if (!DIFF_FILE_VALID(s))
 		die("internal error: asking to populate invalid file.");
 	if (S_ISDIR(s->mode))
@@ -3603,17 +3611,22 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, checksafe)) {
+		ret = convert_to_git(&the_index, s->path, s->data, s->size, &buf, checksafe);
+
+		if (ret) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
 			s->data = strbuf_detach(&buf, &size);
 			s->size = size;
 			s->should_free = 1;
+			if (ret & SAFE_CRLF_REENCODE)
+				 s->reencoded_to_utf8 = 1;
 		}
 	}
 	else {
 		enum object_type type;
+		const char *encoding = NULL;
 		if (size_only || (flags & CHECK_BINARY)) {
 			type = sha1_object_info(s->oid.hash, &s->size);
 			if (type < 0)
@@ -3629,6 +3642,20 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		s->data = read_sha1_file(s->oid.hash, &type, &s->size);
 		if (!s->data)
 			die("unable to read %s", oid_to_hex(&s->oid));
+		if (s->reencode_to_utf8)
+			encoding = get_encoding_attr(s->path);
+		if (encoding) {
+			int outsz = 0;
+			char *buf;
+			buf = reencode_string_len(s->data, (int)s->size,
+						  "UTF-8", encoding, &outsz);
+			if (buf) {
+				free(s->data);
+				s->data = buf;
+				s->size = outsz;
+				s->reencoded_to_utf8 = 1;
+			}
+		}
 		s->should_free = 1;
 	}
 	return 0;
@@ -4627,7 +4654,9 @@ int diff_opt_parse(struct diff_options *options,
 		enable_patch_output(&options->output_format);
 		options->flags.binary = 1;
 	}
-	else if (!strcmp(arg, "--full-index"))
+	else if (!strcmp(arg, "--UTF-8")) {
+		options->flags.reencode_to_utf8 = 1;
+	} else if (!strcmp(arg, "--full-index"))
 		options->flags.full_index = 1;
 	else if (!strcmp(arg, "-a") || !strcmp(arg, "--text"))
 		options->flags.text = 1;
@@ -5695,6 +5724,8 @@ static int diff_filespec_is_identical(struct diff_filespec *one,
 
 static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
 {
+	p->one->reencode_to_utf8 = p->reencode_to_utf8;
+	p->two->reencode_to_utf8 = p->reencode_to_utf8;
 	if (p->done_skip_stat_unmatch)
 		return p->skip_stat_unmatch_result;
 
@@ -5735,6 +5766,7 @@ static void diffcore_skip_stat_unmatch(struct diff_options *diffopt)
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
 
+		p->reencode_to_utf8 = diffopt->flags.reencode_to_utf8;
 		if (diff_filespec_check_stat_unmatch(p))
 			diff_q(&outq, p);
 		else {
diff --git a/diff.h b/diff.h
index 7cf276f077..d2137bab58 100644
--- a/diff.h
+++ b/diff.h
@@ -65,6 +65,7 @@ struct diff_flags {
 	unsigned recursive:1;
 	unsigned tree_in_recursive:1;
 	unsigned binary:1;
+	unsigned reencode_to_utf8:1;
 	unsigned text:1;
 	unsigned full_index:1;
 	unsigned silent_on_remove:1;
diff --git a/diffcore.h b/diffcore.h
index a30da161da..2e84730778 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -47,6 +47,8 @@ struct diff_filespec {
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	/* data should be considered "binary"; -1 means "don't know yet" */
 	signed int is_binary : 2;
+	unsigned reencode_to_utf8 : 1;
+	unsigned reencoded_to_utf8 : 1;
 	struct userdiff_driver *driver;
 };
 
@@ -72,6 +74,7 @@ struct diff_filepair {
 	unsigned is_unmerged : 1;
 	unsigned done_skip_stat_unmatch : 1;
 	unsigned skip_stat_unmatch_result : 1;
+	unsigned reencode_to_utf8 : 1;
 };
 #define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
 
diff --git a/t/t4066-diff-encoding.sh b/t/t4066-diff-encoding.sh
new file mode 100755
index 0000000000..9b89253877
--- /dev/null
+++ b/t/t4066-diff-encoding.sh
@@ -0,0 +1,86 @@
+#!/bin/sh
+
+test_description='git diff with encoding attribute'
+
+. ./test-lib.sh
+
+printf '\303\244rger\n\303\266se\n\303\274bel\n' |
+	iconv -f UTF-8 -t UTF-16 >UTF-16
+printf '\303\266se\n\303\274bel\n\303\245gren\n' |
+	iconv -f UTF-8 -t UTF-16 >file2
+
+test_expect_success 'setup' '
+	cp UTF-16 file &&
+	git add file &&
+	git commit -m "add file in UTF-16" &&
+	test_tick &&
+	echo "file encoding=UTF-16" >.gitattributes
+'
+
+test_expect_success 'diff --UTF-8 against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..06d06e4 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+ågren
+EOF
+	git diff --UTF-8 file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..06d06e4 100644
+	Binary files a/file and b/file differ
+EOF
+	git diff file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'commit local change' '
+	git add file &&
+	git commit -m "add file V2 in UTF-16" &&
+	test_tick
+'
+
+test_expect_success 'diff --UTF-8  HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..06d06e4 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+ågren
+EOF
+	git diff --UTF-8 HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..06d06e4 100644
+	Binary files a/file and b/file differ
+EOF
+	git diff HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.15.1.271.g1a4e40aa5d


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-29 12:59     ` Torsten Bögershausen
@ 2017-12-29 13:56       ` Lars Schneider
  2018-01-03 19:15       ` Junio C Hamano
  1 sibling, 0 replies; 36+ messages in thread
From: Lars Schneider @ 2017-12-29 13:56 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, Junio C Hamano, peff, patrick


> On 29 Dec 2017, at 13:59, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On 2017-12-28 17:14, Lars Schneider wrote:> 
>>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
>>> 
>>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>> 
>>>> +`encoding`
>>>> +^^^^^^^^^^
>>>> +
>>>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>> 
>>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
>> 
>> I am not sure I understand what you want to tell me here.
>> Do you think UTF-8 encoding is not correct in the sentence above?
> 
> Git itself was designed to handle source code in ASCII.
> Text files which are encoded in ISO-8859-1, UTF-8 or any
> super-set of ASCII are handled as well, and what exactly to do
> with bytes >0x80 is outside the responsibility of Git.
> E.g. the interpretation and rendering on the screen may be
> dependent on the locale or being guessed.
> All in all, saying that Git expects UTF-8 may be "overdriven".
> Any kind of file that uses an '\n' as an end of line
> and has no NUL bytes '\0' works as a text file in Git.
> (End of BlaBla ;-)
> 
> We can probably replace:
> "By default Git assumes UTF-8 encoding for text files"
> 
> with something like
> "Git handles UTF-8 encoded files as text files, but UTF-16 encoded
> files are handled as binary files"

Well, UTF-32 are handled as binary file, too :-)

How about this?

    Git recognizes files encoded with ASCII or one of its supersets
    (e.g. UTF-8 or ISO-8859-1) as text files.  All other...


> 
>>> 
>>>> +attribute sets the encoding to be used in the working directory.
>>>> +If the path is added to the index, then Git encodes the content to
>>>> +UTF-8.  On checkout the content is encoded back to the original
>>>> +encoding.  Consequently, you can use all built-in Git text processing
>>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>>> +non-UTF-8 encoded file.
>>>> +
>>>> +Please note that re-encoding content can cause errors and requires
>>>> +resources. Use the `encoding` attribute only if you cannot store
>>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>>> +the text.
>>>> +
>>>> +------------------------
>>>> +*.txt		text encoding=UTF-16
>>>> +------------------------
>>> 
>>> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
>>> (That is not done in convert.c, but assuming binary or even auto
>>> does not make much sense to me)
>> 
>> "text" only means "LF". "-text" means CRLF which would be perfectly fine
>> for UTF-16. Therefore I don't think we need to imply text.
>> Does this make sense?
> Yes and now.
> 
> "text" means convert CRLF at "git add" (or commit) into LF,
> And potentially LF into CRLF at checkout,
> depending on the EOL attribute (if set), core.autocrlf and/or core.eol.
> 
> "-text" means don't touch CRLF or LF at all. Files with CRLF are commited
> with CRLF.
> Running a Unix like "diff" tool will often show "^M" to indicate that there
> is a CR before the LF, which may be distracting or confusing.
> 
> If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
> it makes sense to convert the line endings into LF before storing them
> into the index (at least this is my experience).
> 
> (Not specifying "text" or "-text" at all will trigger the auto-handling,
> which is not needed at all).
> So if we want to be sure that line endings are CRLF in the worktree we
> should ask the user to specify things like this:
> 
> *.ext worktree-encoding=UTF-16 text eol=CRLF
> 
> It may be enough to give this example in the documentation.
> and I would rather be over-careful here, to avoid future problems.

Agreed. I'll add that example!


>>>> +
>>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>>> +UTF-8 are supported.  You can see a full list with the following command:
>>>> +
>>>> +------------------------
>>>> +iconv --list
>>>> +------------------------
>>>> +
>>>> `eol`
>>>> ^^^^^
>>>> 
>>>> diff --git a/convert.c b/convert.c
>>>> index 20d7ab67bd..ee19c17104 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -7,6 +7,7 @@
>>>> #include "sigchain.h"
>>>> #include "pkt-line.h"
>>>> #include "sub-process.h"
>>>> +#include "utf8.h"
>>>> 
>>>> /*
>>>> * convert.c - convert a file when checking it out and checking it in.
>>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>>>> 
>>>> }
>>> 
>>> I would avoid to use these #ifdefs here.
>>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
>> 
>> I'll try that. But wouldn't it make more sense to move the functions to utf.c?
> 
> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
> 
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
> 
> In any case, if it is about strbuf, I would try to put it into strbuf.c

I think I simplified the code. I reused "reencode_string_len".
I added just two BOM check functions to utf8.c ... I'll send out a new series
in a bit.

- Lars


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2017-12-29 12:59     ` Torsten Bögershausen
  2017-12-29 13:56       ` Lars Schneider
@ 2018-01-03 19:15       ` Junio C Hamano
  2018-01-03 20:45         ` Lars Schneider
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2018-01-03 19:15 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Lars Schneider, Git List, peff, patrick

Torsten Bögershausen <tboegi@web.de> writes:

> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
>
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
>
> In any case, if it is about strbuf, I would try to put it into strbuf.c

Please don't.

A code that happens to use strbuf as a container and about
manipulating the contents is quite different from a code about
strbuf.  The latter is to enhance and extend how the strbuf as a
container behaves.  An operation about character encoding for a
string that happens to be stored in strbuf is more about the
encoding, and much much less about strbuf.

convert.c is about massaging contents coming from the outside world
into a shape stored in Git and the other way around, and there are
multiple ways the contents are massaged.  EOL convention may be
adjusted, characters may be reencoded, end-user defined conversion
may be applied.  Some of these operations may use helpers specific
for the task from other more library-ish files, like checking if a
string looks like encoded in UTF-8 from utf8.[ch].

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v1] convert: add support for 'encoding' attribute
  2018-01-03 19:15       ` Junio C Hamano
@ 2018-01-03 20:45         ` Lars Schneider
  0 siblings, 0 replies; 36+ messages in thread
From: Lars Schneider @ 2018-01-03 20:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, Git List, peff,
	patrick


On 03 Jan 2018, at 20:15, Junio C Hamano <gitster@pobox.com> wrote:

> Torsten Bögershausen <tboegi@web.de> writes:
> 
>> May be.
>> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
>> Especially it didn't know anything about strbuf.
>> I don't know why strbuf.h and other functions had been added here,
>> 
>> I once moved them into strbuf.c without any problems, but never send out
>> a patch, because of possible merge conflicts in ongoing patches.
>> 
>> In any case, if it is about strbuf, I would try to put it into strbuf.c
> 
> Please don't.
> 
> A code that happens to use strbuf as a container and about
> manipulating the contents is quite different from a code about
> strbuf.  The latter is to enhance and extend how the strbuf as a
> container behaves.  An operation about character encoding for a
> string that happens to be stored in strbuf is more about the
> encoding, and much much less about strbuf.
> 
> convert.c is about massaging contents coming from the outside world
> into a shape stored in Git and the other way around, and there are
> multiple ways the contents are massaged.  EOL convention may be
> adjusted, characters may be reencoded, end-user defined conversion
> may be applied.  Some of these operations may use helpers specific
> for the task from other more library-ish files, like checking if a
> string looks like encoded in UTF-8 from utf8.[ch].

Agreed. I did that in v2. See these patches:

https://public-inbox.org/git/20171229152222.39680-3-lars.schneider@autodesk.com/
https://public-inbox.org/git/20171229152222.39680-4-lars.schneider@autodesk.com/

- Lars

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8
  2017-12-18 13:12     ` Jeff King
                         ` (3 preceding siblings ...)
  2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
@ 2018-02-26 17:27       ` tboegi
  2018-02-26 18:43         ` Peter Krefting
  2018-02-27 22:39         ` Jeff King
  4 siblings, 2 replies; 36+ messages in thread
From: tboegi @ 2018-02-26 17:27 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When an UTF-16 file is commited and later changed, `git diff` shows
"Binary files XX and YY differ".

When the user wants a diff in UTF-8, a textconv needs to be specified
in .gitattributes and the textconv must be configured.

A more user-friendly diff can be produced for UTF-16 if
- the user did not use `git diff --binary`
- the blob is identified as binary
- the blob has an UTF-16 BOM
- the blob can be converted into UTF-8

Enhance the diff machinery to auto-detect UTF-16 blobs and show them
as UTF-8, unless the user specifies `git diff --binary` which creates
a binary diff.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                   | 43 ++++++++++++++++++++-
 diffcore.h               |  3 ++
 t/t4066-diff-encoding.sh | 98 ++++++++++++++++++++++++++++++++++++++++++++++++
 utf8.h                   | 11 ++++++
 4 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100755 t/t4066-diff-encoding.sh

diff --git a/diff.c b/diff.c
index fb22b19f09..51831ee94d 100644
--- a/diff.c
+++ b/diff.c
@@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
 			strbuf_reset(&header);
 		}
 
+		if (one && one->reencoded_from_utf16)
+			strbuf_addf(&header, "a is converted to UTF-8 from UTF-16\n");
+		if (two && two->reencoded_from_utf16)
+			strbuf_addf(&header, "b is converted to UTF-8 from UTF-16\n");
 		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
 		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
 
@@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 			s->size = size;
 			s->should_free = 1;
 		}
-	}
-	else {
+		if (!s->binary && buffer_is_binary(s->data, s->size) &&
+		    buffer_has_utf16_bom(s->data, s->size)) {
+			int outsz = 0;
+			char *outbuf;
+			outbuf = reencode_string_len(s->data, (int)s->size,
+						     "UTF-8", "UTF-16", &outsz);
+			if (outbuf) {
+				if (s->should_free)
+					free(s->data);
+				if (s->should_munmap)
+					munmap(s->data, s->size);
+				s->should_munmap = 0;
+				s->data = outbuf;
+				s->size = outsz;
+				s->reencoded_from_utf16 = 1;
+				s->should_free = 1;
+			}
+		}
+	} else {
 		enum object_type type;
 		if (size_only || (flags & CHECK_BINARY)) {
 			type = sha1_object_info(s->oid.hash, &s->size);
@@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		s->data = read_sha1_file(s->oid.hash, &type, &s->size);
 		if (!s->data)
 			die("unable to read %s", oid_to_hex(&s->oid));
+		if (!s->binary && buffer_is_binary(s->data, s->size) &&
+		    buffer_has_utf16_bom(s->data, s->size)) {
+			int outsz = 0;
+			char *buf;
+			buf = reencode_string_len(s->data, (int)s->size,
+						  "UTF-8", "UTF-16", &outsz);
+			if (buf) {
+				free(s->data);
+				s->data = buf;
+				s->size = outsz;
+				s->reencoded_from_utf16 = 1;
+			}
+		}
 		s->should_free = 1;
 	}
 	return 0;
@@ -5695,6 +5729,10 @@ static int diff_filespec_is_identical(struct diff_filespec *one,
 
 static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
 {
+	if (p->binary) {
+		p->one->binary = 1;
+		p->two->binary = 1;
+	}
 	if (p->done_skip_stat_unmatch)
 		return p->skip_stat_unmatch_result;
 
@@ -5735,6 +5773,7 @@ static void diffcore_skip_stat_unmatch(struct diff_options *diffopt)
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
 
+		p->binary = diffopt->flags.binary;
 		if (diff_filespec_check_stat_unmatch(p))
 			diff_q(&outq, p);
 		else {
diff --git a/diffcore.h b/diffcore.h
index a30da161da..3cd97bb93b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -47,6 +47,8 @@ struct diff_filespec {
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	/* data should be considered "binary"; -1 means "don't know yet" */
 	signed int is_binary : 2;
+	unsigned binary : 1;
+	unsigned reencoded_from_utf16 : 1;
 	struct userdiff_driver *driver;
 };
 
@@ -72,6 +74,7 @@ struct diff_filepair {
 	unsigned is_unmerged : 1;
 	unsigned done_skip_stat_unmatch : 1;
 	unsigned skip_stat_unmatch_result : 1;
+	unsigned binary : 1;
 };
 #define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
 
diff --git a/t/t4066-diff-encoding.sh b/t/t4066-diff-encoding.sh
new file mode 100755
index 0000000000..9bb3c70ada
--- /dev/null
+++ b/t/t4066-diff-encoding.sh
@@ -0,0 +1,98 @@
+#!/bin/sh
+
+test_description='git diff with encoding attribute'
+
+. ./test-lib.sh
+
+printf '\303\244rger\n\303\266se\n\303\274bel\n' |
+	iconv -f UTF-8 -t UTF-16 >UTF-16
+printf '\303\266se\n\303\274bel\n\303\245st\n' |
+	iconv -f UTF-8 -t UTF-16 >file2
+
+test_expect_success 'setup' '
+	cp UTF-16 file &&
+	git add file &&
+	git commit -m "add file in UTF-16" &&
+	test_tick &&
+	echo "file encoding=UTF-16" >.gitattributes
+'
+
+test_expect_success 'diff against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..e98d27a 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+åst
+EOF
+	git diff file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff --binary against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+	GIT binary patch
+	literal 28
+	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+	literal 32
+	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+
+EOF
+	git diff --binary file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'commit local change' '
+	git add file &&
+	git commit -m "add file V2 in UTF-16" &&
+	test_tick
+'
+
+test_expect_success 'diff HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..e98d27a 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+åst
+EOF
+	git diff HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff --binary HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+	GIT binary patch
+	literal 28
+	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+	literal 32
+	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+	
+EOF
+	git diff --binary HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..a2184d0300 100644
--- a/utf8.h
+++ b/utf8.h
@@ -16,6 +16,17 @@ int utf8_fprintf(FILE *, const char *, ...);
 extern const char utf8_bom[];
 extern int skip_utf8_bom(char **, size_t);
 
+static inline int buffer_has_utf16_bom(const void *buf, size_t len) {
+  const unsigned char *text = (unsigned char *)buf;
+  if (!text ||  len < 2)
+    return 0;
+  if (text[0] == 0xff && text[1] == 0xfe)
+    return 1;
+  if (text[0] == 0xfe && text[1] == 0xff)
+    return 1;
+  return 0;
+}
+
 void strbuf_add_wrapped_text(struct strbuf *buf,
 		const char *text, int indent, int indent2, int width);
 void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,
-- 
2.16.1.194.gb2e45c695d


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8
  2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
@ 2018-02-26 18:43         ` Peter Krefting
  2018-02-27 22:39         ` Jeff King
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Krefting @ 2018-02-26 18:43 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Torsten Bögershausen, peff, j6t, lars.schneider, gitster,
	patrick, larsxschneider

tboegi@web.de:

> +static inline int buffer_has_utf16_bom(const void *buf, size_t len) {
> +  const unsigned char *text = (unsigned char *)buf;
> +  if (!text ||  len < 2)
> +    return 0;
> +  if (text[0] == 0xff && text[1] == 0xfe)
> +    return 1;
> +  if (text[0] == 0xfe && text[1] == 0xff)
> +    return 1;
> +  return 0;
> +}

This would also match a UTF-32LE BOM, not that I think anyone would 
actually use that for text files, but it might be worth adding a test 
for that, just in case?

   if (text[0] == 0xff && text[1] == 0xfe)
     return len < 4 || (text[2] != 0 && text[3] != 0);

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8
  2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
  2018-02-26 18:43         ` Peter Krefting
@ 2018-02-27 22:39         ` Jeff King
  1 sibling, 0 replies; 36+ messages in thread
From: Jeff King @ 2018-02-27 22:39 UTC (permalink / raw)
  To: tboegi; +Cc: j6t, lars.schneider, git, gitster, patrick, larsxschneider

On Mon, Feb 26, 2018 at 06:27:06PM +0100, tboegi@web.de wrote:

> @@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  			s->size = size;
>  			s->should_free = 1;
>  		}
> -	}
> -	else {
> +		if (!s->binary && buffer_is_binary(s->data, s->size) &&
> +		    buffer_has_utf16_bom(s->data, s->size)) {
> +			int outsz = 0;
> +			char *outbuf;
> +			outbuf = reencode_string_len(s->data, (int)s->size,
> +						     "UTF-8", "UTF-16", &outsz);
> +			if (outbuf) {
> +				if (s->should_free)
> +					free(s->data);
> +				if (s->should_munmap)
> +					munmap(s->data, s->size);
> +				s->should_munmap = 0;
> +				s->data = outbuf;
> +				s->size = outsz;
> +				s->reencoded_from_utf16 = 1;
> +				s->should_free = 1;
> +			}
> +		}
> +	} else {

I don't think it makes sense to do the conversion deep inside
diff_populate_filespec(), because it will kick in much more than you'd
want (e.g., "format-patch | am" would stop working with this patch, I
think).

I think you'd want to hook this in at the same level as fill_textconv().
In fact, one way to do it would be to have the get_textconv() stack just
fill in a special driver that does the auto-detection. This is similar
to my earlier patch, but it avoids overriding the user-facing config for
non-textconv things (and naturally any actual configured textconv filter
would override the auto-detection).

-Peff

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-02-27 22:39 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
2017-12-11 18:39 ` Eric Sunshine
2017-12-11 23:47   ` Lars Schneider
2017-12-11 23:58     ` Eric Sunshine
2017-12-12 10:58       ` Lars Schneider
2017-12-11 20:47 ` Johannes Sixt
2017-12-11 23:42   ` Lars Schneider
2017-12-12  0:59     ` Junio C Hamano
2017-12-12  7:15       ` Johannes Sixt
2017-12-12 10:55         ` Lars Schneider
2017-12-12 19:31           ` Junio C Hamano
2017-12-13 17:57             ` Lars Schneider
2017-12-13 18:11               ` Junio C Hamano
2017-12-13 23:02                 ` Lars Schneider
2017-12-14 23:01                   ` Junio C Hamano
2017-12-12  7:09     ` Johannes Sixt
2017-12-18 10:13   ` Torsten Bögershausen
2017-12-18 13:12     ` Jeff King
2017-12-23  8:08       ` Torsten Bögershausen
2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
2018-02-26 18:43         ` Peter Krefting
2018-02-27 22:39         ` Jeff King
2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
2017-12-18 21:55     ` Johannes Sixt
2017-12-15  9:58 ` Jeff King
2017-12-18 10:54   ` Lars Schneider
2017-12-18 12:59     ` Jeff King
2017-12-17 17:14 ` Torsten Bögershausen
2017-12-28 16:14   ` Lars Schneider
2017-12-29 12:59     ` Torsten Bögershausen
2017-12-29 13:56       ` Lars Schneider
2018-01-03 19:15       ` Junio C Hamano
2018-01-03 20:45         ` Lars Schneider

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).