[PATCH v7 0/7] convert: add support for different encodings

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v7 0/7] convert: add support for different encodings
@ 2018-02-15 15:27 lars.schneider
  2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in master.

Changes since v6:

* use consistent casing for core.checkRoundtripEncoding (Junio)
* fix gibberish in commit message (Junio)
* improve documentation (Torsten)
* improve advise messages (Torsten)


Thanks,
Lars

  RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
   v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
   v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
   v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
   v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
   v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353
Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git checkout 2b94bec353


### Interdiff (v6..v7):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index ea5a9509c6..10cb37795d 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Git clients that do not support the `working-tree-encoding` attribute
-  will checkout the respective files UTF-8 encoded and not in the
-  expected encoding. Consequently, these files will appear different
-  which typically causes trouble. This is in particular the case for
-  older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of February 2018).
+- Third party Git implementations that do not support the
+  `working-tree-encoding` attribute will checkout the respective files
+  UTF-8 encoded and not in the expected encoding. Consequently, these
+  files will appear different which typically causes trouble. This is
+  in particular the case for older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of February 2018).

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
-  encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding`
-  to make Git check the round trip encoding (see linkgit:git-config[1]).
-  SHIFT-JIS (Japanese character set) is known to have round trip issues
-  with UTF-8 and is checked by default.
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
@@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

 ------------------------
-*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj 		text working-tree-encoding=UTF-16LE eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
diff --git a/convert.c b/convert.c
index 71dffc7167..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,

 	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
 		const char *error_msg = _(
-			"BOM is prohibited for '%s' if encoded as %s");
+			"BOM is prohibited in '%s' if encoded as %s");
+		/*
+		 * This advise is shown for UTF-??BE and UTF-??LE encodings.
+		 * We truncate the encoding name to 6 chars with %.6s to cut
+		 * off the last two "byte order" characters.
+		 */
 		const char *advise_msg = _(
-			"You told Git to treat '%s' as %s. A byte order mark "
-			"(BOM) is prohibited with this encoding. Either use "
-			"%.6s as working tree encoding or remove the BOM from the "
-			"file.");
-
-		advise(advise_msg, path, enc->name, enc->name, enc->name);
+			"The file '%s' contains a byte order mark (BOM). "
+			"Please use %.6s as working-tree-encoding.");
+		advise(advise_msg, path, enc->name);
 		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
 			error(error_msg, path, enc->name);

-
 	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
 		const char *error_msg = _(
-			"BOM is required for '%s' if encoded as %s");
+			"BOM is required in '%s' if encoded as %s");
 		const char *advise_msg = _(
-			"You told Git to treat '%s' as %s. A byte order mark "
-			"(BOM) is required with this encoding. Either use "
-			"%sBE/%sLE as working tree encoding or add a BOM to the "
-			"file.");
-		advise(advise_msg, path, enc->name, enc->name, enc->name);
+			"The file '%s' is missing a byte order mark (BOM). "
+			"Please use %sBE or %sLE (depending on the byte order) "
+			"as working-tree-encoding.");
+		advise(advise_msg, path, enc->name, enc->name);
 		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
@@ -405,7 +405,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	 * Unicode aims to be a superset of all other character encodings.
 	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
 	 * trip issues [2]. Check the round trip conversion for all encodings
-	 * listed in core.checkRoundTripEncoding.
+	 * listed in core.checkRoundtripEncoding.
 	 *
 	 * The round trip check is only performed if content is written to Git.
 	 * This ensures that no information is lost during conversion to/from
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5dcdd5f899..e4717402a5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -221,10 +221,10 @@ test_expect_success 'check roundtrip encoding' '
 	git reset &&

 	# ... unless we overwrite the Git config!
-	test_config core.checkRoundTripEncoding "garbage" &&
+	test_config core.checkRoundtripEncoding "garbage" &&
 	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for SHIFT-JIS" &&
-	test_unconfig core.checkRoundTripEncoding &&
+	test_unconfig core.checkRoundtripEncoding &&
 	git reset &&

 	# UTF-16 encoded files should not be round-trip checked by default...
@@ -233,14 +233,14 @@ test_expect_success 'check roundtrip encoding' '
 	git reset &&

 	# ... unless we tell Git to check it!
-	test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+	test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
 	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&

 	# ... unless we tell Git to check it!
 	# (here we also check that the casing of the encoding is irrelevant)
-	test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+	test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
 	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&


### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  74 +++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 672 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-16 12:55   ` Ævar Arnfjörð Bjarmason
  2018-02-15 15:27 ` [PATCH v7 2/7] strbuf: add xstrdup_toupper() lars.schneider
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 2/7] strbuf: add xstrdup_toupper()
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
  2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 15:27 ` [PATCH v7 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
  2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-02-15 15:27 ` [PATCH v7 2/7] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 15:27 ` [PATCH v7 4/7] utf8: add function to detect a missing " lars.schneider
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-02-15 15:27 ` [PATCH v7 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 15:27 ` [PATCH v7 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-02-15 15:27 ` [PATCH v7 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 15:27 ` [PATCH v7 6/7] convert: add tracing for " lars.schneider
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  66 ++++++++++++
 convert.c                        | 157 ++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 434 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..5ec179d631 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,72 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Third party Git implementations that do not support the
+  `working-tree-encoding` attribute will checkout the respective files
+  UTF-8 encoded and not in the expected encoding. Consequently, these
+  files will appear different which typically causes trouble. This is
+  in particular the case for older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of February 2018).
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+------------------------
+*.proj		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+------------------------
+*.proj 		text working-tree-encoding=UTF-16LE eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.proj
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..d20c341b6d 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited in '%s' if encoded as %s");
+		/*
+		 * This advise is shown for UTF-??BE and UTF-??LE encodings.
+		 * We truncate the encoding name to 6 chars with %.6s to cut
+		 * off the last two "byte order" characters.
+		 */
+		const char *advise_msg = _(
+			"The file '%s' contains a byte order mark (BOM). "
+			"Please use %.6s as working-tree-encoding.");
+		advise(advise_msg, path, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required in '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"The file '%s' is missing a byte order mark (BOM). "
+			"Please use %sBE or %sLE (depending on the byte order) "
+			"as working-tree-encoding.");
+		advise(advise_msg, path, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(*enc));
+	/*
+	 * Ensure encoding names are always upper case (e.g. UTF-8) to
+	 * simplify subsequent string comparisons.
+	 */
+	enc->name = xstrdup_toupper(value);
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1167,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1176,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1201,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1282,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1312,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1344,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1816,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.working_tree_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..f9ce3e5ef5
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,210 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+
+	# cleanup
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	cat >expectIndexLF <<-\EOF &&
+		i/lf    w/-text attr/text             	eol.utf16
+	EOF
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# Although the file has CRLF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	# The file LF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual&&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no working tree encoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 6/7] convert: add tracing for 'working-tree-encoding' attribute
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-02-15 15:27 ` [PATCH v7 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 15:27 ` [PATCH v7 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 25 +++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index d20c341b6d..c4e2fd5fa5 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index f9ce3e5ef5..01789ae1b8 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v7 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-02-15 15:27 ` [PATCH v7 6/7] convert: add tracing for " lars.schneider
@ 2018-02-15 15:27 ` lars.schneider
  2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
  7 siblings, 0 replies; 41+ messages in thread
From: lars.schneider @ 2018-02-15 15:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundTripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundTripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/config.txt         |  6 ++++
 Documentation/gitattributes.txt  |  8 +++++
 config.c                         |  5 +++
 convert.c                        | 74 ++++++++++++++++++++++++++++++++++++++++
 convert.h                        |  1 +
 environment.c                    |  1 +
 t/t0028-working-tree-encoding.sh | 41 ++++++++++++++++++++++
 7 files changed, 136 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
 	This variable can be set to 'input',
 	in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+	A comma separated list of encodings that Git performs UTF-8 round
+	trip checks on if they are used in an `working-tree-encoding`
+	attribute (see linkgit:gitattributes[5]). The default value is
+	`SHIFT-JIS`.
+
 core.symlinks::
 	If false, symbolic links are checked out as small plain files that
 	contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 5ec179d631..10cb37795d 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -298,6 +298,14 @@ number of pitfalls:
   in particular the case for older Git versions and alternative Git
   implementations such as JGit or libgit2 (as of February 2018).
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkroundtripencoding")) {
+		check_roundtrip_encoding = xstrdup(value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.notesref")) {
 		notes_ref_name = xstrdup(value);
 		return 0;
diff --git a/convert.c b/convert.c
index c4e2fd5fa5..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
 	strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+	/*
+	 * check_roundtrip_encoding contains a string of space and/or
+	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+	 * Search for the given encoding in that string.
+	 */
+	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+	const char *next = found + strlen(enc_name);
+	int len = strlen(check_roundtrip_encoding);
+	return (found && (
+			/*
+			 * check that the found encoding is at the
+			 * beginning of check_roundtrip_encoding or
+			 * that it is prefixed with a space or comma
+			 */
+			found == check_roundtrip_encoding || (
+				found > check_roundtrip_encoding &&
+				(*(found-1) == ' ' || *(found-1) == ',')
+			)
+		) && (
+			/*
+			 * check that the found encoding is at the
+			 * end of check_roundtrip_encoding or
+			 * that it is suffixed with a space or comma
+			 */
+			next == check_roundtrip_encoding + len || (
+				next < check_roundtrip_encoding + len &&
+				(*next == ' ' || *next == ',')
+			)
+		));
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
+	/*
+	 * UTF supports lossless conversion round tripping [1] and conversions
+	 * between UTF and other encodings are mostly round trip safe as
+	 * Unicode aims to be a superset of all other character encodings.
+	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+	 * trip issues [2]. Check the round trip conversion for all encodings
+	 * listed in core.checkRoundtripEncoding.
+	 *
+	 * The round trip check is only performed if content is written to Git.
+	 * This ensures that no information is lost during conversion to/from
+	 * the internal UTF-8 representation.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without an iconv error. Iconv errors
+	 * are already caught above.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			die(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
 }
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };
 
 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 01789ae1b8..e4717402a5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -209,4 +209,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	git reset --hard $BEFORE_STATE
 '
 
+test_expect_success 'check roundtrip encoding' '
+	text="hallo there!\nroundtrip test here!" &&
+	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+	# SHIFT-JIS encoded files are round-trip checked by default...
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# ... unless we overwrite the Git config!
+	test_config core.checkRoundtripEncoding "garbage" &&
+	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	test_unconfig core.checkRoundtripEncoding &&
+	git reset &&
+
+	# UTF-16 encoded files should not be round-trip checked by default...
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	# (here we also check that the casing of the encoding is irrelevant)
+	test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# cleanup
+	rm roundtrip.shift roundtrip.utf16 &&
+	git reset --hard HEAD
+'
+
 test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2018-02-15 15:27 ` [PATCH v7 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
@ 2018-02-15 20:03 ` Junio C Hamano
  2018-02-15 22:09   ` Jeff King
  2018-02-16 14:42   ` Lars Schneider
  7 siblings, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2018-02-15 20:03 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> -- Git clients that do not support the `working-tree-encoding` attribute
> -  will checkout the respective files UTF-8 encoded and not in the
> -  expected encoding. Consequently, these files will appear different
> -  which typically causes trouble. This is in particular the case for
> -  older Git versions and alternative Git implementations such as JGit
> -  or libgit2 (as of February 2018).
> +- Third party Git implementations that do not support the
> +  `working-tree-encoding` attribute will checkout the respective files
> +  UTF-8 encoded and not in the expected encoding. Consequently, these
> +  files will appear different which typically causes trouble. This is
> +  in particular the case for older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of February 2018).

I know somebody found "clients" misleading in the original, but the
ones that do not understand w-t-e do not have to be third party
reimplementations and imitations.  All existing Git implementations,
including ours, don't.

One thing I find more problematic is that the above places *too*
much stress on the UTF-8 centric worldview.  It is perfectly valid
to store your text contents encoded in ShiftJIS and check them out
as-is, with or without this patch.  It is grossly misleading to say
that older versions of Git will check them out in UTF-8.  "will
checkout these files as-is without encoding conversion" is a better
way to say it, probably.

Also notice that even in the world with w-t-e, such a project won't
benefit from w-t-e at all.  After all, they have been happy using
ShiftJIS in repository and using the same encoding on the working
tree, and because w-t-e assumes that everybody should be using UTF-8
internally, such a project cannot take advantage of the new
mechanism.

And from that point of view, perhaps w-t-e attribute is somewhat
misdesigned.

In general, an attribute is about the project's contents in the
manner independent of platform or environment.  You define "this
file is a C source" or "this file has JPEG image" there.  What exact
program you use to present diffs between the two versions of such a
file (external diff command) or what exact program you use to
extract the textual representations (textconv filter) is environment
and platform dependent and is left to the configuration mechanism
for each repository.

To be in line with the above design principle, I think the attribute
ought to be "the in-tree contents of this path is encoded in ..."
whose values could be things like UTF-8, ShiftJIS, etc.  What
external encoding the paths should be checked out is not a
project-wide matter, especially when talking about cross platform
projects.  Perhaps a project in Japanese language wants to check
out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
systems.  The participants all need to know what in-repository
encoding is used, which is a sensible use of attributes.  They also
need to know what the recommended external encoding to be used in
the working tree is for their platforms, but that is more like what
Makefile variable to set for their platforms, etc., and is not a
good match to the attributes system.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
@ 2018-02-15 22:09   ` Jeff King
  2018-02-16 18:55     ` Junio C Hamano
  2018-02-16 14:42   ` Lars Schneider
  1 sibling, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-15 22:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin, Lars Schneider

On Thu, Feb 15, 2018 at 12:03:06PM -0800, Junio C Hamano wrote:

> And from that point of view, perhaps w-t-e attribute is somewhat
> misdesigned.
> 
> In general, an attribute is about the project's contents in the
> manner independent of platform or environment.  You define "this
> file is a C source" or "this file has JPEG image" there.  What exact
> program you use to present diffs between the two versions of such a
> file (external diff command) or what exact program you use to
> extract the textual representations (textconv filter) is environment
> and platform dependent and is left to the configuration mechanism
> for each repository.
> 
> To be in line with the above design principle, I think the attribute
> ought to be "the in-tree contents of this path is encoded in ..."
> whose values could be things like UTF-8, ShiftJIS, etc.  What
> external encoding the paths should be checked out is not a
> project-wide matter, especially when talking about cross platform
> projects.  Perhaps a project in Japanese language wants to check
> out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
> systems.  The participants all need to know what in-repository
> encoding is used, which is a sensible use of attributes.  They also
> need to know what the recommended external encoding to be used in
> the working tree is for their platforms, but that is more like what
> Makefile variable to set for their platforms, etc., and is not a
> good match to the attributes system.

While I agree what you're saying philosophically here, I suspect you'd
still need another attribute for "no really, this needs to be checked
out as encoding X". The same way we treat line endings as a platform
decision, but we still need to have `eol=crlf` for those files which
really, no matter what platform you're on, have external tools depending
on them to have some particular line ending.

So a full proposal would support both cases: "check this out in the
local platform's preferred encoding" and "always check this out in
_this_ encoding". And Lars's proposal is just the second half of that.

But I'm not sure anybody even really cares about the first part; I don't
think we've seen anybody actually ask for it.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-02-16 12:55   ` Ævar Arnfjörð Bjarmason
  2018-02-16 18:45     ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-02-16 12:55 UTC (permalink / raw)
  To: lars.schneider
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes Schindelin, Lars Schneider

On Thu, Feb 15, 2018 at 4:27 PM,  <lars.schneider@autodesk.com> wrote:
> Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
> allocate the buffer for the lower case string with xmallocz(). This
> already ensures a NUL at the end of the allocated buffer.
>
> Remove the unnecessary assignment.
> [...]
>         for (i = 0; i < len; i++)
>                 result[i] = tolower(string[i]);
> -       result[i] = '\0';
>         return result;
>  }

I agree with this approach, but it's worth noting for other reviewers
that there's been some disagreement here on-list (between Eric & I)
about whether these sorts of patterns should be removed or kept
(although the calloc() case is slightly different from mallocz()),
see: https://public-inbox.org/git/871shum182.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
  2018-02-15 22:09   ` Jeff King
@ 2018-02-16 14:42   ` Lars Schneider
  2018-02-16 16:58     ` Torsten Bögershausen
  2018-02-16 19:04     ` Junio C Hamano
  1 sibling, 2 replies; 41+ messages in thread
From: Lars Schneider @ 2018-02-16 14:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 15 Feb 2018, at 21:03, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> -- Git clients that do not support the `working-tree-encoding` attribute
>> -  will checkout the respective files UTF-8 encoded and not in the
>> -  expected encoding. Consequently, these files will appear different
>> -  which typically causes trouble. This is in particular the case for
>> -  older Git versions and alternative Git implementations such as JGit
>> -  or libgit2 (as of February 2018).
>> +- Third party Git implementations that do not support the
>> +  `working-tree-encoding` attribute will checkout the respective files
>> +  UTF-8 encoded and not in the expected encoding. Consequently, these
>> +  files will appear different which typically causes trouble. This is
>> +  in particular the case for older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of February 2018).
> 
> I know somebody found "clients" misleading in the original, but the
> ones that do not understand w-t-e do not have to be third party
> reimplementations and imitations.  All existing Git implementations,
> including ours, don't.

Agreed!


> One thing I find more problematic is that the above places *too*
> much stress on the UTF-8 centric worldview.  It is perfectly valid
> to store your text contents encoded in ShiftJIS and check them out
> as-is, with or without this patch.  It is grossly misleading to say
> that older versions of Git will check them out in UTF-8.  "will
> checkout these files as-is without encoding conversion" is a better
> way to say it, probably.

True. But that's not what I wanted to say in the "pitfalls" section.
If my Git client supports w-t-e and I add the ShiftJIS encoded
file "foo.bar" to my repository, then Git will store the file as
UTF-8 _internally_. That means if you clone my repository and your 
Git client does _not_ support w-t-e, then you will see "foo.bar" as 
UTF-8 encoded.


> Also notice that even in the world with w-t-e, such a project won't
> benefit from w-t-e at all.  After all, they have been happy using
> ShiftJIS in repository and using the same encoding on the working
> tree, and because w-t-e assumes that everybody should be using UTF-8
> internally, such a project cannot take advantage of the new
> mechanism.

Agreed. However, people using ShiftJIS are not my target audience.
My target audience are:

(1) People that have to encode their text files in UTF-16 (for 
    whatever reason - usually because of legacy processes or tools).

(2) People that want to see textual diffs of their UTF-16 encoded 
    files in their Git tools without special adjustments (on the 
    command line, on the web, etc).

That was my primary motivation. The fact that w-t-e supports any
other encoding too is just a nice side effect. I don't foresee people
using other w-t-encodings other than UTF-16 in my organization.

I have the suspicion that the feature could be useful for the Git
community at large. Consider this Stack Overflow question:
https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text

This question was viewed 42k times and there is no good solution.
I believe w-t-e could be a good solution.



With your comments in mind, I tried to improve the text like this:

    Git recognizes files encoded with ASCII or one of its supersets (e.g.
    UTF-8, ISO-8859-1, ...) as text files.  Files encoded with certain other
    encodings (e.g. UTF-16) are interpreted as binary and consequently
    built-in Git text processing tools (e.g. 'git diff') as well as most Git
    web front ends do not visualize the contents of these files by default.

    ...

    Please note that using the `working-tree-encoding` attribute may have a
    number of pitfalls:

    - Alternative Git implementations (e.g. JGit or libgit2) and older Git 
      versions (as of February 2018) do not support the `working-tree-encoding`
      attribute. If you decide to use the `working-tree-encoding` attribute
      in your repository, then it is strongly recommended to ensure that all
      clients working with the repository support it.

      If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
      `working-tree-encoding` enabled Git client, then `foo.proj` will be
      stored as UTF-8 internally. A client without `working-tree-encoding`
      support will checkout `foo.proj` as UTF-8 encoded file. This will
      typically cause trouble for the users of this file.

      If a Git client, that does not support the `working-tree-encoding`
      attribute, adds a new file `bar.proj`, then `bar.proj` will be
      stored "as-is" internally (in this example probably as UTF-16). 
      A client with `working-tree-encoding` support will interpret the 
      internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
      That operation will fail and cause an error.

    ...



> And from that point of view, perhaps w-t-e attribute is somewhat
> misdesigned.
> 
> In general, an attribute is about the project's contents in the
> manner independent of platform or environment.  You define "this
> file is a C source" or "this file has JPEG image" there.  What exact
> program you use to present diffs between the two versions of such a
> file (external diff command) or what exact program you use to
> extract the textual representations (textconv filter) is environment
> and platform dependent and is left to the configuration mechanism
> for each repository.

> To be in line with the above design principle, I think the attribute
> ought to be "the in-tree contents of this path is encoded in ..."
> whose values could be things like UTF-8, ShiftJIS, etc.  What
> external encoding the paths should be checked out is not a
> project-wide matter, especially when talking about cross platform
> projects.  Perhaps a project in Japanese language wants to check
> out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
> systems.  The participants all need to know what in-repository
> encoding is used, which is a sensible use of attributes.  They also
> need to know what the recommended external encoding to be used in
> the working tree is for their platforms, but that is more like what
> Makefile variable to set for their platforms, etc., and is not a
> good match to the attributes system.

As mentioned above, this is not my intended usecase here. As Peff
mentioned elsewhere "always check this out in _this_ encoding"
is the goal here.


Thanks a lot for feedback,
Lars

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 14:42   ` Lars Schneider
@ 2018-02-16 16:58     ` Torsten Bögershausen
  2018-02-22 20:00       ` Lars Schneider
  2018-02-16 19:04     ` Junio C Hamano
  1 sibling, 1 reply; 41+ messages in thread
From: Torsten Bögershausen @ 2018-02-16 16:58 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote:
[]
> 
> Agreed. However, people using ShiftJIS are not my target audience.
> My target audience are:
> 
> (1) People that have to encode their text files in UTF-16 (for 
>     whatever reason - usually because of legacy processes or tools).
> 
> (2) People that want to see textual diffs of their UTF-16 encoded 
>     files in their Git tools without special adjustments (on the 
>     command line, on the web, etc).
> 
> That was my primary motivation. The fact that w-t-e supports any
> other encoding too is just a nice side effect. I don't foresee people
> using other w-t-encodings other than UTF-16 in my organization.
> 
> I have the suspicion that the feature could be useful for the Git
> community at large. Consider this Stack Overflow question:
> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
> 
> This question was viewed 42k times and there is no good solution.
> I believe w-t-e could be a good solution.
> 

If it was only about a diff of UTF-16 files, I may suggest a patch.
I simply copy-paste it here for review, if someone thinks that it may
be useful, I can send it as a real patch/RFC.

git show HEAD


commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415
Author: Torsten Bögershausen <tboegi@web.de>
Date:   Fri Feb 2 15:35:23 2018 +0100

    Auto diff of UTF-16 files in UTF-8
    
    When an UTF-16 file is commited and later changed, `git diff` shows
    "Binary files XX and YY differ".
    
    When the user wants a diff in UTF-8, a textconv needs to be specified
    in .gitattributes and the textconv must be configured.
    
    A more user-friendly diff can be produced for UTF-16 if
    - the user did not use `git diff --binary`
    - the blob is identified as binary
    - the blob has an UTF-16 BOM
    - the blob can be converted into UTF-8
    
    Enhance the diff machinery to auto-detect UTF-16 blobs and show them
    as UTF-8, unless the user specifies `git diff --binary` which creates
    a binary diff.

diff --git a/diff.c b/diff.c
index fb22b19f09..51831ee94d 100644
--- a/diff.c
+++ b/diff.c
@@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
 			strbuf_reset(&header);
 		}
 
+		if (one && one->reencoded_from_utf16)
+			strbuf_addf(&header, "a is converted to UTF-8 from UTF-16\n");
+		if (two && two->reencoded_from_utf16)
+			strbuf_addf(&header, "b is converted to UTF-8 from UTF-16\n");
 		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
 		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
 
@@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 			s->size = size;
 			s->should_free = 1;
 		}
-	}
-	else {
+		if (!s->binary && buffer_is_binary(s->data, s->size) &&
+		    buffer_has_utf16_bom(s->data, s->size)) {
+			int outsz = 0;
+			char *outbuf;
+			outbuf = reencode_string_len(s->data, (int)s->size,
+						     "UTF-8", "UTF-16", &outsz);
+			if (outbuf) {
+				if (s->should_free)
+					free(s->data);
+				if (s->should_munmap)
+					munmap(s->data, s->size);
+				s->should_munmap = 0;
+				s->data = outbuf;
+				s->size = outsz;
+				s->reencoded_from_utf16 = 1;
+				s->should_free = 1;
+			}
+		}
+	} else {
 		enum object_type type;
 		if (size_only || (flags & CHECK_BINARY)) {
 			type = sha1_object_info(s->oid.hash, &s->size);
@@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		s->data = read_sha1_file(s->oid.hash, &type, &s->size);
 		if (!s->data)
 			die("unable to read %s", oid_to_hex(&s->oid));
+		if (!s->binary && buffer_is_binary(s->data, s->size) &&
+		    buffer_has_utf16_bom(s->data, s->size)) {
+			int outsz = 0;
+			char *buf;
+			buf = reencode_string_len(s->data, (int)s->size,
+						  "UTF-8", "UTF-16", &outsz);
+			if (buf) {
+				free(s->data);
+				s->data = buf;
+				s->size = outsz;
+				s->reencoded_from_utf16 = 1;
+			}
+		}
 		s->should_free = 1;
 	}
 	return 0;
@@ -5695,6 +5729,10 @@ static int diff_filespec_is_identical(struct diff_filespec *one,
 
 static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
 {
+	if (p->binary) {
+		p->one->binary = 1;
+		p->two->binary = 1;
+	}
 	if (p->done_skip_stat_unmatch)
 		return p->skip_stat_unmatch_result;
 
@@ -5735,6 +5773,7 @@ static void diffcore_skip_stat_unmatch(struct diff_options *diffopt)
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
 
+		p->binary = diffopt->flags.binary;
 		if (diff_filespec_check_stat_unmatch(p))
 			diff_q(&outq, p);
 		else {
diff --git a/diffcore.h b/diffcore.h
index a30da161da..3cd97bb93b 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -47,6 +47,8 @@ struct diff_filespec {
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	/* data should be considered "binary"; -1 means "don't know yet" */
 	signed int is_binary : 2;
+	unsigned binary : 1;
+	unsigned reencoded_from_utf16 : 1;
 	struct userdiff_driver *driver;
 };
 
@@ -72,6 +74,7 @@ struct diff_filepair {
 	unsigned is_unmerged : 1;
 	unsigned done_skip_stat_unmatch : 1;
 	unsigned skip_stat_unmatch_result : 1;
+	unsigned binary : 1;
 };
 #define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
 
diff --git a/t/t4066-diff-encoding.sh b/t/t4066-diff-encoding.sh
new file mode 100755
index 0000000000..9bb3c70ada
--- /dev/null
+++ b/t/t4066-diff-encoding.sh
@@ -0,0 +1,98 @@
+#!/bin/sh
+
+test_description='git diff with encoding attribute'
+
+. ./test-lib.sh
+
+printf '\303\244rger\n\303\266se\n\303\274bel\n' |
+	iconv -f UTF-8 -t UTF-16 >UTF-16
+printf '\303\266se\n\303\274bel\n\303\245st\n' |
+	iconv -f UTF-8 -t UTF-16 >file2
+
+test_expect_success 'setup' '
+	cp UTF-16 file &&
+	git add file &&
+	git commit -m "add file in UTF-16" &&
+	test_tick &&
+	echo "file encoding=UTF-16" >.gitattributes
+'
+
+test_expect_success 'diff against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..e98d27a 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+åst
+EOF
+	git diff file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff --binary against local change' '
+	cp file2 file &&
+	test_tick &&
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+	GIT binary patch
+	literal 28
+	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+	literal 32
+	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+
+EOF
+	git diff --binary file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'commit local change' '
+	git add file &&
+	git commit -m "add file V2 in UTF-16" &&
+	test_tick
+'
+
+test_expect_success 'diff HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09..e98d27a 100644
+	a is converted to UTF-8 from UTF-16
+	b is converted to UTF-8 from UTF-16
+	--- a/file
+	+++ b/file
+	@@ -1,3 +1,3 @@
+	-ärger
+	 öse
+	 übel
+	+åst
+EOF
+	git diff HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff --binary HEAD against HEAD^' '
+	cat >expect <<-\EOF &&
+	diff --git a/file b/file
+	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
+	GIT binary patch
+	literal 28
+	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
+
+	literal 32
+	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
+	
+EOF
+	git diff --binary HEAD^ HEAD -- file >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..a2184d0300 100644
--- a/utf8.h
+++ b/utf8.h
@@ -16,6 +16,17 @@ int utf8_fprintf(FILE *, const char *, ...);
 extern const char utf8_bom[];
 extern int skip_utf8_bom(char **, size_t);
 
+static inline int buffer_has_utf16_bom(const void *buf, size_t len) {
+  const unsigned char *text = (unsigned char *)buf;
+  if (!text ||  len < 2)
+    return 0;
+  if (text[0] == 0xff && text[1] == 0xfe)
+    return 1;
+  if (text[0] == 0xfe && text[1] == 0xff)
+    return 1;
+  return 0;
+}
+
 void strbuf_add_wrapped_text(struct strbuf *buf,
 		const char *text, int indent, int indent2, int width);
 void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-16 12:55   ` Ævar Arnfjörð Bjarmason
@ 2018-02-16 18:45     ` Jeff King
  2018-02-16 19:30       ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-16 18:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: lars.schneider, Git Mailing List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Eric Sunshine,
	Ramsay Jones, Johannes Schindelin, Lars Schneider

On Fri, Feb 16, 2018 at 01:55:02PM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Feb 15, 2018 at 4:27 PM,  <lars.schneider@autodesk.com> wrote:
> > Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
> > allocate the buffer for the lower case string with xmallocz(). This
> > already ensures a NUL at the end of the allocated buffer.
> >
> > Remove the unnecessary assignment.
> > [...]
> >         for (i = 0; i < len; i++)
> >                 result[i] = tolower(string[i]);
> > -       result[i] = '\0';
> >         return result;
> >  }
> 
> I agree with this approach, but it's worth noting for other reviewers
> that there's been some disagreement here on-list (between Eric & I)
> about whether these sorts of patterns should be removed or kept
> (although the calloc() case is slightly different from mallocz()),
> see: https://public-inbox.org/git/871shum182.fsf@evledraar.gmail.com/

Hmm. I do think xmallocz is a bit more explicit instruction of "please
NUL-terminate this for me" than xcalloc is. So I don't think it's
inconsistent to say this one is OK, but the trailing-NULL one that you
linked is not.

I'm not sure that I have a strong opinion on either case. But in general
I'd probably err on the side of leaving such lines in, for the sake of
being explicit.

Of course this particular case could just be:

  char *result = xstrdup(string);
  for (i = 0; result[i]; i++)
	result[i] = tolower(result[i]);

I picked the current implementation in 88d5a6f6cd (daemon/config: factor
out duplicate xstrdup_tolower, 2014-05-22) because it might be more
efficient (it avoids an extra copy), but I doubt it matters much in
practice.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-15 22:09   ` Jeff King
@ 2018-02-16 18:55     ` Junio C Hamano
  2018-02-16 19:25       ` Jeff King
  2018-02-21 18:06       ` Lars Schneider
  0 siblings, 2 replies; 41+ messages in thread
From: Junio C Hamano @ 2018-02-16 18:55 UTC (permalink / raw)
  To: Jeff King
  Cc: lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin, Lars Schneider

Jeff King <peff@peff.net> writes:

> So a full proposal would support both cases: "check this out in the
> local platform's preferred encoding" and "always check this out in
> _this_ encoding". And Lars's proposal is just the second half of that.

Actually, what you seem to take as a whole is just half of the
story.  The other half that is an ability to say "what is in the
repository for this path is stored in this encoding".  I agree that
"check it out in this encoding" is a useful thing to have, and using
the in-tree .gitattributes as a place to state the project-wide
preference may be OK (and .git/info/attributes should be able to
override it if needed -- this probably deserves to be added to a
test somewhere by this series).

Luckily, lack of 'in-repository-encoding' attribute is not a show
stopper for this series.  A later topic could start with "earlier,
in order to make use of w-t-e attribute, you had to have your
contents in UTF-8.  Teach the codepath to honor a new attribute that
tells in what encoding the blob contents are stored." without having
to be a part of this topic.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 14:42   ` Lars Schneider
  2018-02-16 16:58     ` Torsten Bögershausen
@ 2018-02-16 19:04     ` Junio C Hamano
  1 sibling, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2018-02-16 19:04 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

>> One thing I find more problematic is that the above places *too*
>> much stress on the UTF-8 centric worldview.  It is perfectly valid
>> to store your text contents encoded in ShiftJIS and check them out
>> as-is, with or without this patch.  It is grossly misleading to say
>> that older versions of Git will check them out in UTF-8.  "will
>> checkout these files as-is without encoding conversion" is a better
>> way to say it, probably.
>
> True. But that's not what I wanted to say in the "pitfalls" section.
> If my Git client supports w-t-e and I add the ShiftJIS encoded
> file "foo.bar" to my repository, then Git will store the file as
> UTF-8 _internally_. That means if you clone my repository and your 
> Git client does _not_ support w-t-e, then you will see "foo.bar" as 
> UTF-8 encoded.

What you wrote implies *more* than that, which is what I had trouble
with.

If you say "what you have is checked out as-is", then it is still
clear that those who use w-t-e to convert non UTF-8 into UTF-8 when
checking in will get UTF-8 out when they use an older version of
Git.  If you say "what you have will be checked out in UTF-8", it
makes it sound as if pre w-t-e Git will somehow reject non UTF-8
in-tree contents, or magically convert anything to UTF-8 while
checking out, which is *not* what you want to imply.

>> Also notice that even in the world with w-t-e, such a project won't
>> benefit from w-t-e at all.  After all, they have been happy using
>> ShiftJIS in repository and using the same encoding on the working
>> tree, and because w-t-e assumes that everybody should be using UTF-8
>> internally, such a project cannot take advantage of the new
>> mechanism.
>
> Agreed. However, people using ShiftJIS are not my target audience.

Be aware that you are writing *not* *solely* for your target
audience.  You write document for everybody, and make sure the
description of a feature makes it clear who the feature primarily
targets and how using (or not using) the feature affects users.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 18:55     ` Junio C Hamano
@ 2018-02-16 19:25       ` Jeff King
  2018-02-16 19:27         ` Jeff King
  2018-02-21 18:06       ` Lars Schneider
  1 sibling, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-16 19:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin, Lars Schneider

On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So a full proposal would support both cases: "check this out in the
> > local platform's preferred encoding" and "always check this out in
> > _this_ encoding". And Lars's proposal is just the second half of that.
> 
> Actually, what you seem to take as a whole is just half of the
> story.  The other half that is an ability to say "what is in the
> repository for this path is stored in this encoding".  I agree that
> "check it out in this encoding" is a useful thing to have, and using
> the in-tree .gitattributes as a place to state the project-wide
> preference may be OK (and .git/info/attributes should be able to
> override it if needed -- this probably deserves to be added to a
> test somewhere by this series).

If we are just talking about a check-out feature, I'm not sure that the
in-repository encoding is all that interesting. As with CRLFs, we would
be declaring UTF-8 as the "canonical" in-repo encoding for such
conversions. Is there a reason you'd want something else?

If the feature were "the in-repo encoding is X, and I want you to show
me a diff using encoding Y", then I could see the use of that (and I
think for most people's purposes that would be an equally valid solution
to their problem).

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 19:25       ` Jeff King
@ 2018-02-16 19:27         ` Jeff King
  2018-02-16 19:41           ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-16 19:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin, Lars Schneider

On Fri, Feb 16, 2018 at 02:25:41PM -0500, Jeff King wrote:

> On Fri, Feb 16, 2018 at 10:55:58AM -0800, Junio C Hamano wrote:
> 
> > Jeff King <peff@peff.net> writes:
> > 
> > > So a full proposal would support both cases: "check this out in the
> > > local platform's preferred encoding" and "always check this out in
> > > _this_ encoding". And Lars's proposal is just the second half of that.
> > 
> > Actually, what you seem to take as a whole is just half of the
> > story.  The other half that is an ability to say "what is in the
> > repository for this path is stored in this encoding".  I agree that
> > "check it out in this encoding" is a useful thing to have, and using
> > the in-tree .gitattributes as a place to state the project-wide
> > preference may be OK (and .git/info/attributes should be able to
> > override it if needed -- this probably deserves to be added to a
> > test somewhere by this series).
> 
> If we are just talking about a check-out feature, I'm not sure that the
> in-repository encoding is all that interesting. As with CRLFs, we would
> be declaring UTF-8 as the "canonical" in-repo encoding for such
> conversions. Is there a reason you'd want something else?

Maybe answering my own question: because your encoding of choice does
not round-trip to UTF-8?

In which case yeah, I could see choosing an in-repo encoding to possibly
be useful (but it also seems like a feature that could easily be tacked
on later if somebody cares).

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-16 18:45     ` Jeff King
@ 2018-02-16 19:30       ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2018-02-16 19:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, lars.schneider,
	Git Mailing List, Torsten Bögershausen, Johannes Sixt,
	Eric Sunshine, Ramsay Jones, Johannes Schindelin, Lars Schneider

Jeff King <peff@peff.net> writes:

>> (although the calloc() case is slightly different from mallocz()),
>> see: https://public-inbox.org/git/871shum182.fsf@evledraar.gmail.com/
>
> Hmm. I do think xmallocz is a bit more explicit instruction of "please
> NUL-terminate this for me" than xcalloc is. So I don't think it's
> inconsistent to say this one is OK, but the trailing-NULL one that you
> linked is not.

Yeah, I too thought "slightly different" was an understatement of
the week ;-).


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 19:27         ` Jeff King
@ 2018-02-16 19:41           ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2018-02-16 19:41 UTC (permalink / raw)
  To: Jeff King
  Cc: lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin, Lars Schneider

Jeff King <peff@peff.net> writes:

> In which case yeah, I could see choosing an in-repo encoding to possibly
> be useful (but it also seems like a feature that could easily be tacked
> on later if somebody cares).

Yes, I think we are on the same page---in-repo-encoding that is a
natural counterpart to w-t-e attribute can be added later if/when
somebody finds it useful, and it is perfectly OK to declare that we
cater only to UTF-8 users until that happens.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 18:55     ` Junio C Hamano
  2018-02-16 19:25       ` Jeff King
@ 2018-02-21 18:06       ` Lars Schneider
  1 sibling, 0 replies; 41+ messages in thread
From: Lars Schneider @ 2018-02-21 18:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, lars.schneider, git, tboegi, j6t, sunshine, ramsay,
	Johannes.Schindelin


> On 16 Feb 2018, at 19:55, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Jeff King <peff@peff.net> writes:
> 
>> So a full proposal would support both cases: "check this out in the
>> local platform's preferred encoding" and "always check this out in
>> _this_ encoding". And Lars's proposal is just the second half of that.
> 
> Actually, what you seem to take as a whole is just half of the
> story.  The other half that is an ability to say "what is in the
> repository for this path is stored in this encoding".  I agree that
> "check it out in this encoding" is a useful thing to have, and using
> the in-tree .gitattributes as a place to state the project-wide
> preference may be OK (and .git/info/attributes should be able to
> override it if needed -- this probably deserves to be added to a
> test somewhere by this series).

Good call! I'll add this test case!


> Luckily, lack of 'in-repository-encoding' attribute is not a show
> stopper for this series.  A later topic could start with "earlier,
> in order to make use of w-t-e attribute, you had to have your
> contents in UTF-8.  Teach the codepath to honor a new attribute that
> tells in what encoding the blob contents are stored." without having
> to be a part of this topic.

I have the impression that this is the purpose of the already existing 
"encoding" attribute, no? AFAIK this attribute is only respected by 
gitk, though. A future series could make Git respect this attribute too.


- Lars


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-16 16:58     ` Torsten Bögershausen
@ 2018-02-22 20:00       ` Lars Schneider
  2018-02-22 20:12         ` Jeff King
  2018-02-23 16:35         ` Junio C Hamano
  0 siblings, 2 replies; 41+ messages in thread
From: Lars Schneider @ 2018-02-22 20:00 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Junio C Hamano, lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 16 Feb 2018, at 17:58, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Fri, Feb 16, 2018 at 03:42:35PM +0100, Lars Schneider wrote:
> []
>> 
>> Agreed. However, people using ShiftJIS are not my target audience.
>> My target audience are:
>> 
>> (1) People that have to encode their text files in UTF-16 (for 
>>    whatever reason - usually because of legacy processes or tools).
>> 
>> (2) People that want to see textual diffs of their UTF-16 encoded 
>>    files in their Git tools without special adjustments (on the 
>>    command line, on the web, etc).
>> 
>> That was my primary motivation. The fact that w-t-e supports any
>> other encoding too is just a nice side effect. I don't foresee people
>> using other w-t-encodings other than UTF-16 in my organization.
>> 
>> I have the suspicion that the feature could be useful for the Git
>> community at large. Consider this Stack Overflow question:
>> https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
>> 
>> This question was viewed 42k times and there is no good solution.
>> I believe w-t-e could be a good solution.
>> 
> 
> If it was only about a diff of UTF-16 files, I may suggest a patch.
> I simply copy-paste it here for review, if someone thinks that it may
> be useful, I can send it as a real patch/RFC.

That's a nice idea but I see two potential problems:

(1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still
    show these files as binary. That's a huge problem for my users as
    they interact more with these services than the Git command line.
    That's the main reason why I implemented the "UTF-8 as canonical
    form" approach in my series.

(2) You can only detect a BOM if the encoding is UTF-16. UTF-16BE and
    UTF-16LE must not have a BOM and therefore cannot be easily
    detected. Plus, even if you detect an UTF-16 BOM then it would be 
    just a hint that the file is likely UTF-16 encoded as the sequence
    could be there by chance. 

I still think it would be nice to see diffs for arbitrary encodings.
Would it be an option to read the `encoding` attribute and use it in
`git diff`?

- Lars


> 
> git show HEAD
> 
> 
> commit 9f7d43f29eaf6017b7b16261ce91d8ef182cf415
> Author: Torsten Bögershausen <tboegi@web.de>
> Date:   Fri Feb 2 15:35:23 2018 +0100
> 
>    Auto diff of UTF-16 files in UTF-8
> 
>    When an UTF-16 file is commited and later changed, `git diff` shows
>    "Binary files XX and YY differ".
> 
>    When the user wants a diff in UTF-8, a textconv needs to be specified
>    in .gitattributes and the textconv must be configured.
> 
>    A more user-friendly diff can be produced for UTF-16 if
>    - the user did not use `git diff --binary`
>    - the blob is identified as binary
>    - the blob has an UTF-16 BOM
>    - the blob can be converted into UTF-8
> 
>    Enhance the diff machinery to auto-detect UTF-16 blobs and show them
>    as UTF-8, unless the user specifies `git diff --binary` which creates
>    a binary diff.
> 
> diff --git a/diff.c b/diff.c
> index fb22b19f09..51831ee94d 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3192,6 +3192,10 @@ static void builtin_diff(const char *name_a,
> 			strbuf_reset(&header);
> 		}
> 
> +		if (one && one->reencoded_from_utf16)
> +			strbuf_addf(&header, "a is converted to UTF-8 from UTF-16\n");
> +		if (two && two->reencoded_from_utf16)
> +			strbuf_addf(&header, "b is converted to UTF-8 from UTF-16\n");
> 		mf1.size = fill_textconv(textconv_one, one, &mf1.ptr);
> 		mf2.size = fill_textconv(textconv_two, two, &mf2.ptr);
> 
> @@ -3611,8 +3615,25 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> 			s->size = size;
> 			s->should_free = 1;
> 		}
> -	}
> -	else {
> +		if (!s->binary && buffer_is_binary(s->data, s->size) &&
> +		    buffer_has_utf16_bom(s->data, s->size)) {
> +			int outsz = 0;
> +			char *outbuf;
> +			outbuf = reencode_string_len(s->data, (int)s->size,
> +						     "UTF-8", "UTF-16", &outsz);
> +			if (outbuf) {
> +				if (s->should_free)
> +					free(s->data);
> +				if (s->should_munmap)
> +					munmap(s->data, s->size);
> +				s->should_munmap = 0;
> +				s->data = outbuf;
> +				s->size = outsz;
> +				s->reencoded_from_utf16 = 1;
> +				s->should_free = 1;
> +			}
> +		}
> +	} else {
> 		enum object_type type;
> 		if (size_only || (flags & CHECK_BINARY)) {
> 			type = sha1_object_info(s->oid.hash, &s->size);
> @@ -3629,6 +3650,19 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> 		s->data = read_sha1_file(s->oid.hash, &type, &s->size);
> 		if (!s->data)
> 			die("unable to read %s", oid_to_hex(&s->oid));
> +		if (!s->binary && buffer_is_binary(s->data, s->size) &&
> +		    buffer_has_utf16_bom(s->data, s->size)) {
> +			int outsz = 0;
> +			char *buf;
> +			buf = reencode_string_len(s->data, (int)s->size,
> +						  "UTF-8", "UTF-16", &outsz);
> +			if (buf) {
> +				free(s->data);
> +				s->data = buf;
> +				s->size = outsz;
> +				s->reencoded_from_utf16 = 1;
> +			}
> +		}
> 		s->should_free = 1;
> 	}
> 	return 0;
> @@ -5695,6 +5729,10 @@ static int diff_filespec_is_identical(struct diff_filespec *one,
> 
> static int diff_filespec_check_stat_unmatch(struct diff_filepair *p)
> {
> +	if (p->binary) {
> +		p->one->binary = 1;
> +		p->two->binary = 1;
> +	}
> 	if (p->done_skip_stat_unmatch)
> 		return p->skip_stat_unmatch_result;
> 
> @@ -5735,6 +5773,7 @@ static void diffcore_skip_stat_unmatch(struct diff_options *diffopt)
> 	for (i = 0; i < q->nr; i++) {
> 		struct diff_filepair *p = q->queue[i];
> 
> +		p->binary = diffopt->flags.binary;
> 		if (diff_filespec_check_stat_unmatch(p))
> 			diff_q(&outq, p);
> 		else {
> diff --git a/diffcore.h b/diffcore.h
> index a30da161da..3cd97bb93b 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -47,6 +47,8 @@ struct diff_filespec {
> 	unsigned has_more_entries : 1; /* only appear in combined diff */
> 	/* data should be considered "binary"; -1 means "don't know yet" */
> 	signed int is_binary : 2;
> +	unsigned binary : 1;
> +	unsigned reencoded_from_utf16 : 1;
> 	struct userdiff_driver *driver;
> };
> 
> @@ -72,6 +74,7 @@ struct diff_filepair {
> 	unsigned is_unmerged : 1;
> 	unsigned done_skip_stat_unmatch : 1;
> 	unsigned skip_stat_unmatch_result : 1;
> +	unsigned binary : 1;
> };
> #define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
> 
> diff --git a/t/t4066-diff-encoding.sh b/t/t4066-diff-encoding.sh
> new file mode 100755
> index 0000000000..9bb3c70ada
> --- /dev/null
> +++ b/t/t4066-diff-encoding.sh
> @@ -0,0 +1,98 @@
> +#!/bin/sh
> +
> +test_description='git diff with encoding attribute'
> +
> +. ./test-lib.sh
> +
> +printf '\303\244rger\n\303\266se\n\303\274bel\n' |
> +	iconv -f UTF-8 -t UTF-16 >UTF-16
> +printf '\303\266se\n\303\274bel\n\303\245st\n' |
> +	iconv -f UTF-8 -t UTF-16 >file2
> +
> +test_expect_success 'setup' '
> +	cp UTF-16 file &&
> +	git add file &&
> +	git commit -m "add file in UTF-16" &&
> +	test_tick &&
> +	echo "file encoding=UTF-16" >.gitattributes
> +'
> +
> +test_expect_success 'diff against local change' '
> +	cp file2 file &&
> +	test_tick &&
> +	cat >expect <<-\EOF &&
> +	diff --git a/file b/file
> +	index 26acf09..e98d27a 100644
> +	a is converted to UTF-8 from UTF-16
> +	b is converted to UTF-8 from UTF-16
> +	--- a/file
> +	+++ b/file
> +	@@ -1,3 +1,3 @@
> +	-ärger
> +	 öse
> +	 übel
> +	+åst
> +EOF
> +	git diff file >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'diff --binary against local change' '
> +	cp file2 file &&
> +	test_tick &&
> +	cat >expect <<-\EOF &&
> +	diff --git a/file b/file
> +	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
> +	GIT binary patch
> +	literal 28
> +	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
> +
> +	literal 32
> +	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
> +
> +EOF
> +	git diff --binary file >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'commit local change' '
> +	git add file &&
> +	git commit -m "add file V2 in UTF-16" &&
> +	test_tick
> +'
> +
> +test_expect_success 'diff HEAD against HEAD^' '
> +	cat >expect <<-\EOF &&
> +	diff --git a/file b/file
> +	index 26acf09..e98d27a 100644
> +	a is converted to UTF-8 from UTF-16
> +	b is converted to UTF-8 from UTF-16
> +	--- a/file
> +	+++ b/file
> +	@@ -1,3 +1,3 @@
> +	-ärger
> +	 öse
> +	 übel
> +	+åst
> +EOF
> +	git diff HEAD^ HEAD -- file >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'diff --binary HEAD against HEAD^' '
> +	cat >expect <<-\EOF &&
> +	diff --git a/file b/file
> +	index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
> +	GIT binary patch
> +	literal 28
> +	ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi
> +
> +	literal 32
> +	icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4
> +	
> +EOF
> +	git diff --binary HEAD^ HEAD -- file >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
> diff --git a/utf8.h b/utf8.h
> index 6bbcf31a83..a2184d0300 100644
> --- a/utf8.h
> +++ b/utf8.h
> @@ -16,6 +16,17 @@ int utf8_fprintf(FILE *, const char *, ...);
> extern const char utf8_bom[];
> extern int skip_utf8_bom(char **, size_t);
> 
> +static inline int buffer_has_utf16_bom(const void *buf, size_t len) {
> +  const unsigned char *text = (unsigned char *)buf;
> +  if (!text ||  len < 2)
> +    return 0;
> +  if (text[0] == 0xff && text[1] == 0xfe)
> +    return 1;
> +  if (text[0] == 0xfe && text[1] == 0xff)
> +    return 1;
> +  return 0;
> +}
> +
> void strbuf_add_wrapped_text(struct strbuf *buf,
> 		const char *text, int indent, int indent2, int width);
> void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-22 20:00       ` Lars Schneider
@ 2018-02-22 20:12         ` Jeff King
  2018-02-23 16:35         ` Junio C Hamano
  1 sibling, 0 replies; 41+ messages in thread
From: Jeff King @ 2018-02-22 20:12 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, Junio C Hamano, lars.schneider, git,
	j6t, sunshine, ramsay, Johannes.Schindelin

On Thu, Feb 22, 2018 at 09:00:45PM +0100, Lars Schneider wrote:

> > If it was only about a diff of UTF-16 files, I may suggest a patch.
> > I simply copy-paste it here for review, if someone thinks that it may
> > be useful, I can send it as a real patch/RFC.
> 
> That's a nice idea but I see two potential problems:
> 
> (1) Git hosting services (GitLab, BitBucket, GitHub, ...) would still
>     show these files as binary. That's a huge problem for my users as
>     they interact more with these services than the Git command line.
>     That's the main reason why I implemented the "UTF-8 as canonical
>     form" approach in my series.

I can't speak for the other services, but I can tell you that GitHub
would be pretty eager to enable such a feature if it existed.

I suspect most services providing human-readable diffs would want to do
the same. Though there are still cases where you'd see a binary patch
(e.g., format-patch in emails, or GitHub's .patch endpoint, since those
are meant to be applied and must contain the "real" data).

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-22 20:00       ` Lars Schneider
  2018-02-22 20:12         ` Jeff King
@ 2018-02-23 16:35         ` Junio C Hamano
  2018-02-23 20:11           ` Junio C Hamano
  1 sibling, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2018-02-23 16:35 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, lars.schneider, git, j6t, sunshine,
	peff, ramsay, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> I still think it would be nice to see diffs for arbitrary encodings.
> Would it be an option to read the `encoding` attribute and use it in
> `git diff`?

Reusing that gitk-only thing and suddenly start doing so would break
gitk users, no?  The tool expects the diff to come out encoded in
the encoding that is specified by that attribute (which is learned
from get_path_encoding helper) and does its thing.

I guess that gitk uses diff-tree plumbing and you won't be applying
this change to the plumbing, perhaps?  If so, it might not be too
bad, but those who decided to postprocess "git diff" output (instead
of "git diff-tree" output) mimicking how gitk does it by thinking
that is the safe and sane thing to do will be broken by such a
change.  You could do "use the encoding only when a command line
option says so", but then people will add a configuration variable
to turn it always on and these existing scripts will be broken.

I do not personally have much sympathy for the last case (i.e. those
who scripted around 'git diff' instead of 'git diff-tree' to get
broken), so making the new feature only work with the Porcelain "git
diff" might be an option.  I'll need a bit more time to formulate
the rest of my thought ;-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-23 16:35         ` Junio C Hamano
@ 2018-02-23 20:11           ` Junio C Hamano
  2018-02-24 15:18             ` Lars Schneider
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2018-02-23 20:11 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, lars.schneider, git, j6t, sunshine,
	peff, ramsay, Johannes.Schindelin

Junio C Hamano <gitster@pobox.com> writes:

> Lars Schneider <larsxschneider@gmail.com> writes:
>
>> I still think it would be nice to see diffs for arbitrary encodings.
>> Would it be an option to read the `encoding` attribute and use it in
>> `git diff`?
>
> Reusing that gitk-only thing and suddenly start doing so would break
> gitk users, no?  The tool expects the diff to come out encoded in
> the encoding that is specified by that attribute (which is learned
> from get_path_encoding helper) and does its thing.
>
> I guess that gitk uses diff-tree plumbing and you won't be applying
> this change to the plumbing, perhaps?  If so, it might not be too
> bad, but those who decided to postprocess "git diff" output (instead
> of "git diff-tree" output) mimicking how gitk does it by thinking
> that is the safe and sane thing to do will be broken by such a
> change.  You could do "use the encoding only when a command line
> option says so", but then people will add a configuration variable
> to turn it always on and these existing scripts will be broken.
>
> I do not personally have much sympathy for the last case (i.e. those
> who scripted around 'git diff' instead of 'git diff-tree' to get
> broken), so making the new feature only work with the Porcelain "git
> diff" might be an option.  I'll need a bit more time to formulate
> the rest of my thought ;-)

So we are introducing in this series a way to say in what encoding
the things should be placed in the working tree files (i.e. the
w-t-e attribute attached to the paths).  Currently there is no
mechanism to say what encoding the in-repo contents are and UTF-8 is
assumed when conversion from/to w-t-e is required, but there is no
fundamental reason why it shouldn't be customizable (if anything, as
a piece of fact on the in-repo data, in-repo-encoding is *more*
appropriate to be an attribute than w-t-e that can merely be project
preference at best, as I mentioned earlier in this thread).  

We always use the in-repo contents when generating 'diff'.  I think
by "attribute to be used in diff", what you are reallying after is
to convert the in-repo contents to that encoding _BEFORE_ running
'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
the place will not diff well with the xdiff machinery, but if you
first convert it to UTF-8 and have xdiff work on it, you can get
reasonable result out of it.  It is unclear what encoding you want
your final diff output in (it is equally valid in such a set-up to
desire your patch output in UTF-16 or UTF-8), but assuming that you
want UTF-8 in your patch output, perhaps we do not have to break
gitk users by hijacking the 'encoding' attribute.  Instead what you
want is a single bit that says between in-repo or working tree which
representation should be given to the xdiff machinery.  When that
bit is set, then we

 - First ensure that both sides of the diff input is in the working
   tree encoding by running it through convert_to_working_tree();

 - Run xdiff on it;

 - Take the xdiff result, and run it through convert_to_git(),
   before emitting (this is optional, making this a one-and-half bit
   option).

That would allow you to say "I have in-repo data in UTF-16 which I
check out as UTF-8.  xdiff machinery is unhappy.  Please do
something." perhaps?

The other way around (i.e. in-repo is UTF-8, but working tree
encoding is UTF-16) won't need xdiff issues, I would imagine.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-23 20:11           ` Junio C Hamano
@ 2018-02-24 15:18             ` Lars Schneider
  2018-02-26  1:44               ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Lars Schneider @ 2018-02-24 15:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, git, Johannes Sixt,
	Eric Sunshine, peff, ramsay, Johannes.Schindelin


> On 23 Feb 2018, at 21:11, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Lars Schneider <larsxschneider@gmail.com> writes:
>> 
>>> I still think it would be nice to see diffs for arbitrary encodings.
>>> Would it be an option to read the `encoding` attribute and use it in
>>> `git diff`?
>> 
>> Reusing that gitk-only thing and suddenly start doing so would break
>> gitk users, no?  The tool expects the diff to come out encoded in
>> the encoding that is specified by that attribute (which is learned
>> from get_path_encoding helper) and does its thing.
>> 
>> I guess that gitk uses diff-tree plumbing and you won't be applying
>> this change to the plumbing, perhaps?  If so, it might not be too
>> bad, but those who decided to postprocess "git diff" output (instead
>> of "git diff-tree" output) mimicking how gitk does it by thinking
>> that is the safe and sane thing to do will be broken by such a
>> change.  You could do "use the encoding only when a command line
>> option says so", but then people will add a configuration variable
>> to turn it always on and these existing scripts will be broken.
>> 
>> I do not personally have much sympathy for the last case (i.e. those
>> who scripted around 'git diff' instead of 'git diff-tree' to get
>> broken), so making the new feature only work with the Porcelain "git
>> diff" might be an option.  I'll need a bit more time to formulate
>> the rest of my thought ;-)
> 
> So we are introducing in this series a way to say in what encoding
> the things should be placed in the working tree files (i.e. the
> w-t-e attribute attached to the paths).  Currently there is no
> mechanism to say what encoding the in-repo contents are and UTF-8 is
> assumed when conversion from/to w-t-e is required, but there is no
> fundamental reason why it shouldn't be customizable (if anything, as
> a piece of fact on the in-repo data, in-repo-encoding is *more*
> appropriate to be an attribute than w-t-e that can merely be project
> preference at best, as I mentioned earlier in this thread).

Correct.


> We always use the in-repo contents when generating 'diff'.  I think
> by "attribute to be used in diff", what you are reallying after is
> to convert the in-repo contents to that encoding _BEFORE_ running
> 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> the place will not diff well with the xdiff machinery, but if you
> first convert it to UTF-8 and have xdiff work on it, you can get
> reasonable result out of it.  It is unclear what encoding you want
> your final diff output in (it is equally valid in such a set-up to
> desire your patch output in UTF-16 or UTF-8), but assuming that you
> want UTF-8 in your patch output, perhaps we do not have to break
> gitk users by hijacking the 'encoding' attribute.  Instead what you
> want is a single bit that says between in-repo or working tree which
> representation should be given to the xdiff machinery.

I fear that we could confuse users with an additional knob/bit that
defines what we diff against. Git always diff'ed against in-repo 
content and I feel it should stay that way.

However, I agree with your earlier emails that "working-tree-encoding"
is just one half of the feature. I also think it would be nice to be
able to define the "in-repo-encoding" as well. Then we could define 
something like that:

    *.foo 		text in-repo-encoding=UTF-16LE

This tells Git that the file is stored as UTF-16LE. This would help Git
generating a diff via UTF-8 conversion. I feel that the final patch 
should be in UTF-16LE again. Maybe over time we could then deprecate the
"encoding" attribute as the "in-repo-encoding" attribute serves a similar 
purpose (maybe gitk can switch to it).

In that case we could also do things like that:

    *.bar 		text working-tree-encoding=SHIFT-JIS in-repo-encoding=UTF-16LE

SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin.
On checkout the opposite would happen. This way we would lift the
"UTF-8 is the only in-repo encoding" limitation of the current w-t-e
implementation.

Does this sound sensible to you? That being said, I think "in-repo-encoding"
would deserve an own series.

- Lars

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-24 15:18             ` Lars Schneider
@ 2018-02-26  1:44               ` Jeff King
  2018-02-26 17:35                 ` Torsten Bögershausen
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-26  1:44 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Torsten Bögershausen, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote:

> > We always use the in-repo contents when generating 'diff'.  I think
> > by "attribute to be used in diff", what you are reallying after is
> > to convert the in-repo contents to that encoding _BEFORE_ running
> > 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> > the place will not diff well with the xdiff machinery, but if you
> > first convert it to UTF-8 and have xdiff work on it, you can get
> > reasonable result out of it.  It is unclear what encoding you want
> > your final diff output in (it is equally valid in such a set-up to
> > desire your patch output in UTF-16 or UTF-8), but assuming that you
> > want UTF-8 in your patch output, perhaps we do not have to break
> > gitk users by hijacking the 'encoding' attribute.  Instead what you
> > want is a single bit that says between in-repo or working tree which
> > representation should be given to the xdiff machinery.
> 
> I fear that we could confuse users with an additional knob/bit that
> defines what we diff against. Git always diff'ed against in-repo 
> content and I feel it should stay that way.

Well, except for textconv. You can already do this:

  echo "foo diff=utf16" >.gitattributes
  git config diff.utf16.textconv 'iconv -f utf16 -t utf8'

We could make that easier to use and much more efficient by:

  1. Allowing a special syntax for textconv filters that kicks off an
     internal iconv.

  2. Providing baked-in config for utf16.

The patch below provides a sketch. But I think Torsten raised a good
point that you might want the encoding conversion to be independent of
other diff characteristics (so, e.g., you might say "this is utf16 but
once converted treat it like C code for finding funcnames, etc").

---
diff --git a/diff.c b/diff.c
index 21c3838b25..04032e059c 100644
--- a/diff.c
+++ b/diff.c
@@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options *options, const char *pat
 	return pair;
 }
 
+static char *iconv_textconv(const char *encoding, struct diff_filespec *spec,
+			    size_t *outsize)
+{
+	char *ret;
+	int outsize_int; /* this really should be a size_t */
+
+	if (diff_populate_filespec(spec, 0))
+		die("unable to load content for %s", spec->path);
+	ret = reencode_string_len(spec->data, spec->size,
+				  "utf-8", /* should be log_output_encoding? */
+				  encoding, &outsize_int);
+	*outsize = outsize_int;
+	return ret;
+}
+
 static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 		size_t *outsize)
 {
@@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
 	struct strbuf buf = STRBUF_INIT;
 	int err = 0;
 
+	if (skip_prefix(pgm, "iconv:", &pgm))
+		return iconv_textconv(pgm, spec, outsize);
+
 	temp = prepare_temp_file(spec->path, spec);
 	*arg++ = pgm;
 	*arg++ = temp->name;
diff --git a/userdiff.c b/userdiff.c
index dbfb4e13cd..48fa7e8bdd 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -161,6 +161,7 @@ IPATTERN("css",
 	 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
 	 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
 ),
+{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
 { "default", NULL, -1, { NULL, 0 } },
 };
 #undef PATTERNS

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-26  1:44               ` Jeff King
@ 2018-02-26 17:35                 ` Torsten Bögershausen
  2018-02-26 20:46                   ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Torsten Bögershausen @ 2018-02-26 17:35 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Sun, Feb 25, 2018 at 08:44:46PM -0500, Jeff King wrote:
> On Sat, Feb 24, 2018 at 04:18:36PM +0100, Lars Schneider wrote:
> 
> > > We always use the in-repo contents when generating 'diff'.  I think
> > > by "attribute to be used in diff", what you are reallying after is
> > > to convert the in-repo contents to that encoding _BEFORE_ running
> > > 'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
> > > the place will not diff well with the xdiff machinery, but if you
> > > first convert it to UTF-8 and have xdiff work on it, you can get
> > > reasonable result out of it.  It is unclear what encoding you want
> > > your final diff output in (it is equally valid in such a set-up to
> > > desire your patch output in UTF-16 or UTF-8), but assuming that you
> > > want UTF-8 in your patch output, perhaps we do not have to break
> > > gitk users by hijacking the 'encoding' attribute.  Instead what you
> > > want is a single bit that says between in-repo or working tree which
> > > representation should be given to the xdiff machinery.
> > 
> > I fear that we could confuse users with an additional knob/bit that
> > defines what we diff against. Git always diff'ed against in-repo 
> > content and I feel it should stay that way.
> 
> Well, except for textconv. You can already do this:
> 
>   echo "foo diff=utf16" >.gitattributes
>   git config diff.utf16.textconv 'iconv -f utf16 -t utf8'
> 
> We could make that easier to use and much more efficient by:
> 
>   1. Allowing a special syntax for textconv filters that kicks off an
>      internal iconv.
> 
>   2. Providing baked-in config for utf16.
> 
> The patch below provides a sketch. But I think Torsten raised a good
> point that you might want the encoding conversion to be independent of
> other diff characteristics (so, e.g., you might say "this is utf16 but
> once converted treat it like C code for finding funcnames, etc").
> 
> ---
> diff --git a/diff.c b/diff.c
> index 21c3838b25..04032e059c 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -5968,6 +5968,21 @@ struct diff_filepair *diff_unmerge(struct diff_options *options, const char *pat
>  	return pair;
>  }
>  
> +static char *iconv_textconv(const char *encoding, struct diff_filespec *spec,
> +			    size_t *outsize)
> +{
> +	char *ret;
> +	int outsize_int; /* this really should be a size_t */
> +
> +	if (diff_populate_filespec(spec, 0))
> +		die("unable to load content for %s", spec->path);
> +	ret = reencode_string_len(spec->data, spec->size,
> +				  "utf-8", /* should be log_output_encoding? */
> +				  encoding, &outsize_int);
> +	*outsize = outsize_int;
> +	return ret;
> +}
> +
>  static char *run_textconv(const char *pgm, struct diff_filespec *spec,
>  		size_t *outsize)
>  {
> @@ -5978,6 +5993,9 @@ static char *run_textconv(const char *pgm, struct diff_filespec *spec,
>  	struct strbuf buf = STRBUF_INIT;
>  	int err = 0;
>  
> +	if (skip_prefix(pgm, "iconv:", &pgm))
> +		return iconv_textconv(pgm, spec, outsize);
> +
>  	temp = prepare_temp_file(spec->path, spec);
>  	*arg++ = pgm;
>  	*arg++ = temp->name;
> diff --git a/userdiff.c b/userdiff.c
> index dbfb4e13cd..48fa7e8bdd 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -161,6 +161,7 @@ IPATTERN("css",
>  	 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
>  	 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
>  ),
> +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
>  { "default", NULL, -1, { NULL, 0 } },
>  };
>  #undef PATTERNS

The patch looks like a possible step into the right direction -
some minor notes: "utf8" is better written as "UTF-8", when talking
to iconv.h, same for utf16.

But, how do I activate the diff ?
I have in .gitattributes
XXXenglish.txt diff=UTF-16

and in .git/config
[diff "UTF-16"]
      command = iconv:UTF-16


What am I doing wrong ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-26 17:35                 ` Torsten Bögershausen
@ 2018-02-26 20:46                   ` Jeff King
  2018-02-27 21:05                     ` Torsten Bögershausen
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-26 20:46 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote:

> > diff --git a/userdiff.c b/userdiff.c
> > index dbfb4e13cd..48fa7e8bdd 100644
> > --- a/userdiff.c
> > +++ b/userdiff.c
> > @@ -161,6 +161,7 @@ IPATTERN("css",
> >  	 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
> >  	 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
> >  ),
> > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
> >  { "default", NULL, -1, { NULL, 0 } },
> >  };
> >  #undef PATTERNS
> 
> The patch looks like a possible step into the right direction -
> some minor notes: "utf8" is better written as "UTF-8", when talking
> to iconv.h, same for utf16.
> 
> But, how do I activate the diff ?
> I have in .gitattributes
> XXXenglish.txt diff=UTF-16
> 
> and in .git/config
> [diff "UTF-16"]
>       command = iconv:UTF-16
> 
> 
> What am I doing wrong ?

After applying the patch, if I do:

  git init
  echo hello | iconv -f utf8 -t utf16 >file
  git add file
  git commit -m one
  echo goodbye | iconv -f utf8 -t utf16 >file
  git add file
  git commit -m two

then:

  git log -p

shows "binary files differ" but:

  echo "file diff=utf16" >.gitattributes
  git log -p

shows text diffs. I assume you tweaked the patch before switching to
the UTF-16 spelling in your example. Did you use a plumbing command to
show the diff? textconv isn't enabled for plumbing, because the
resulting patches cannot actually be applied (in that sense an encoding
switch is potentially special, since in theory one could convert to the
canonical text format, apply the patch, and then convert back).

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-26 20:46                   ` Jeff King
@ 2018-02-27 21:05                     ` Torsten Bögershausen
  2018-02-27 21:25                       ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Torsten Bögershausen @ 2018-02-27 21:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Mon, Feb 26, 2018 at 03:46:35PM -0500, Jeff King wrote:
> On Mon, Feb 26, 2018 at 06:35:33PM +0100, Torsten Bögershausen wrote:
> 
> > > diff --git a/userdiff.c b/userdiff.c
> > > index dbfb4e13cd..48fa7e8bdd 100644
> > > --- a/userdiff.c
> > > +++ b/userdiff.c
> > > @@ -161,6 +161,7 @@ IPATTERN("css",
> > >  	 "-?[_a-zA-Z][-_a-zA-Z0-9]*" /* identifiers */
> > >  	 "|-?[0-9]+|\\#[0-9a-fA-F]+" /* numbers */
> > >  ),
> > > +{ "utf16", NULL, -1, { NULL, 0 }, NULL, "iconv:utf16" },
> > >  { "default", NULL, -1, { NULL, 0 } },
> > >  };
> > >  #undef PATTERNS
> > 
> > The patch looks like a possible step into the right direction -
> > some minor notes: "utf8" is better written as "UTF-8", when talking
> > to iconv.h, same for utf16.
> > 
> > But, how do I activate the diff ?
> > I have in .gitattributes
> > XXXenglish.txt diff=UTF-16
> > 
> > and in .git/config
> > [diff "UTF-16"]
> >       command = iconv:UTF-16
> > 
> > 
> > What am I doing wrong ?
> 
> After applying the patch, if I do:
> 
>   git init
>   echo hello | iconv -f utf8 -t utf16 >file
>   git add file
>   git commit -m one
>   echo goodbye | iconv -f utf8 -t utf16 >file
>   git add file
>   git commit -m two
> 
> then:
> 
>   git log -p
> 
> shows "binary files differ" but:
> 
>   echo "file diff=utf16" >.gitattributes
>   git log -p
> 
> shows text diffs. I assume you tweaked the patch before switching to
> the UTF-16 spelling in your example. Did you use a plumbing command to
> show the diff? textconv isn't enabled for plumbing, because the
> resulting patches cannot actually be applied (in that sense an encoding
> switch is potentially special, since in theory one could convert to the
> canonical text format, apply the patch, and then convert back).
> 
> -Peff

Thanks for helping me out.
I didn't use "git log -p", but a simple "git diff".
(And after re-using utf16 with lowercase, it works as you described it)

I wasn't aware of "git log -p", something learned (or re-learned)

The other question is:
Would this help showing diffs of UTF-16 encoded files on a "git hoster",
github/bitbucket/.... ?

Or would the auto-magic UTF-16 avoid binary patch that I send out be more helpful ?
Or both ?
Or the w-t-e encoding ?

Questions over questions.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:05                     ` Torsten Bögershausen
@ 2018-02-27 21:25                       ` Jeff King
  2018-02-27 21:55                         ` Junio C Hamano
                                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Jeff King @ 2018-02-27 21:25 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:

> The other question is:
> Would this help showing diffs of UTF-16 encoded files on a "git hoster",
> github/bitbucket/.... ?

Almost. There's probably one more thing needed. We don't currently read
in-tree .gitattributes when doing a diff in a bare repository. And most
hosting sites will store bare repositories.

And of course it would require the users to actually set the attributes
themselves.

> Or would the auto-magic UTF-16 avoid binary patch that I send out be more helpful ?
> Or both ?
> Or the w-t-e encoding ?

Of the three solutions, I think the relative merits are something like
this:

  1. baked-in textconv (my patch)

     - reuses an existing diff feature, so minimal code and not likely to
       break things

     - requires people to add a .gitattributes entry

     - needs work to make bare-repo .gitattributes work (though I think
       this would be useful for other features, too)

     - has a run-time cost at each diff to do the conversion

     - may sometimes annoy people when it doesn't kick in (e.g.,
       emailed patches from format-patch won't have a readable diff)

     - doesn't combine with other custom-diff config (e.g., utf-16
       storing C code should still use diff=c funcname rules, but
       wouldn't with my patch)

  2. auto-detect utf-16 (your patch)
     - Just Works for existing repositories storing utf-16

     - carries some risk of kicking in when people would like it not to
       (e.g., when they really do want a binary patch that can be
       applied).

       I think it would probably be OK if this kicked in only when
       ALLOW_TEXTCONV is set (the default for porcelain), and --binary
       is not (i.e., when we would otherwise just say "binary
       files differ").

     - similar to (1), carries a run-time cost for each diff, and users
       may sometimes still see binary diffs

  3. w-t-e (Lars's patch)

     - requires no server-side modifications; the diff is plain vanilla

     - works everywhere you diff, plumbing and porcelain

     - does require people to add a .gitattributes entry

     - run-time cost is per-checkout, not per-diff

So I can see room for (3) to co-exist alongside the others. Between (1)
and (2), I think (2) is probably the better direction.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:25                       ` Jeff King
@ 2018-02-27 21:55                         ` Junio C Hamano
  2018-02-27 21:58                           ` Jeff King
  2018-02-28  8:20                         ` Torsten Bögershausen
  2018-02-28 20:46                         ` Lars Schneider
  2 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2018-02-27 21:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

Jeff King <peff@peff.net> writes:

> Of the three solutions, I think the relative merits are something like
> this:
> ...
>   3. w-t-e (Lars's patch)

I thought Lars's w-t-e was about keeping the in-repo contents in
UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
won't help the issue hosting folks want to deal with, i.e. showing
in-repo data that is stored in a strange binary-looking encoding in
a more reasonable encodign while diffing, no?

Usually we only work in-repo encoding when producing a diff and show
the result in in-repo encoding, but I can imagine a new attribute,
when set, we first convert in-repo to the specified encoding before
passing the result to xdiff machinery.  Then convert it back to
in-repo encoding before showing the diff (or just show the result in
that encoding xdiff machinery processed---I do not know which one
should be the default).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:55                         ` Junio C Hamano
@ 2018-02-27 21:58                           ` Jeff King
  2018-02-27 22:10                             ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2018-02-27 21:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Of the three solutions, I think the relative merits are something like
> > this:
> > ...
> >   3. w-t-e (Lars's patch)
> 
> I thought Lars's w-t-e was about keeping the in-repo contents in
> UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
> won't help the issue hosting folks want to deal with, i.e. showing
> in-repo data that is stored in a strange binary-looking encoding in
> a more reasonable encodign while diffing, no?

I thought it solved that by the hosting folks never seeing the strange
binary-looking data. They see only utf8, which diffs well.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:58                           ` Jeff King
@ 2018-02-27 22:10                             ` Junio C Hamano
  2018-02-27 22:20                               ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2018-02-27 22:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

Jeff King <peff@peff.net> writes:

> On Tue, Feb 27, 2018 at 01:55:02PM -0800, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > Of the three solutions, I think the relative merits are something like
>> > this:
>> > ...
>> >   3. w-t-e (Lars's patch)
>> 
>> I thought Lars's w-t-e was about keeping the in-repo contents in
>> UTF-8 and externalize in whatever encoding (e.g. UTF-16), so it
>> won't help the issue hosting folks want to deal with, i.e. showing
>> in-repo data that is stored in a strange binary-looking encoding in
>> a more reasonable encodign while diffing, no?
>
> I thought it solved that by the hosting folks never seeing the strange
> binary-looking data. They see only utf8, which diffs well.

Ah, OK, that is a "fix" in a wider context (in a narrower context,
"work around" is a more appropriate term ;-).

The reason why I have been nudging people toward considering in-repo
encoding attribute is because forcing projects that already have
their contents in a strange binary-looking encoding to switch is
costly.  But perhaps having them pay one-time conversion pain is a
better investment longer term.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 22:10                             ` Junio C Hamano
@ 2018-02-27 22:20                               ` Jeff King
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff King @ 2018-02-27 22:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Tue, Feb 27, 2018 at 02:10:20PM -0800, Junio C Hamano wrote:

> > I thought it solved that by the hosting folks never seeing the strange
> > binary-looking data. They see only utf8, which diffs well.
> 
> Ah, OK, that is a "fix" in a wider context (in a narrower context,
> "work around" is a more appropriate term ;-).
> 
> The reason why I have been nudging people toward considering in-repo
> encoding attribute is because forcing projects that already have
> their contents in a strange binary-looking encoding to switch is
> costly.  But perhaps having them pay one-time conversion pain is a
> better investment longer term.

Yeah, thanks for mentioning that. It should have gone in my "relative
merits" list. The conversion flag-day is definitely going to be a pain
for users, and doesn't help with diffing older versions.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:25                       ` Jeff King
  2018-02-27 21:55                         ` Junio C Hamano
@ 2018-02-28  8:20                         ` Torsten Bögershausen
  2018-02-28 13:21                           ` Jeff King
  2018-02-28 20:46                         ` Lars Schneider
  2 siblings, 1 reply; 41+ messages in thread
From: Torsten Bögershausen @ 2018-02-28  8:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Tue, Feb 27, 2018 at 04:25:38PM -0500, Jeff King wrote:
> On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:
> 
> > The other question is:
> > Would this help showing diffs of UTF-16 encoded files on a "git hoster",
> > github/bitbucket/.... ?
> 
> Almost. There's probably one more thing needed. We don't currently read
> in-tree .gitattributes when doing a diff in a bare repository. And most
> hosting sites will store bare repositories.
> 
> And of course it would require the users to actually set the attributes
> themselves.
> 
> > Or would the auto-magic UTF-16 avoid binary patch that I send out be more helpful ?
> > Or both ?
> > Or the w-t-e encoding ?
> 
> Of the three solutions, I think the relative merits are something like
> this:
> 
>   1. baked-in textconv (my patch)
> 
>      - reuses an existing diff feature, so minimal code and not likely to
>        break things
> 
>      - requires people to add a .gitattributes entry
> 
>      - needs work to make bare-repo .gitattributes work (though I think
>        this would be useful for other features, too)
> 
>      - has a run-time cost at each diff to do the conversion
> 
>      - may sometimes annoy people when it doesn't kick in (e.g.,
>        emailed patches from format-patch won't have a readable diff)
> 
>      - doesn't combine with other custom-diff config (e.g., utf-16
>        storing C code should still use diff=c funcname rules, but
>        wouldn't with my patch)
> 
>   2. auto-detect utf-16 (your patch)
>      - Just Works for existing repositories storing utf-16
> 
>      - carries some risk of kicking in when people would like it not to
>        (e.g., when they really do want a binary patch that can be
>        applied).

The binary patch is still supported, but that detail may need some more explanation
in the commit message. Please see  t4066-diff-encoding.sh
  test_expect_success 'diff --binary against local change' '
         cp file2 file &&
         test_tick &&
         cat >expect <<-\EOF &&
         diff --git a/file b/file
         index 26acf09b0aad19fb22566956d1a39cb4e2a3b420..e98d27acfb90cfcfc84fcc5173baa4aa7828290f 100644
         GIT binary patch
         literal 28
         ecmezW?;ArgLn;Fo!ykquAe{qbJq3!C0BHb{ln3Pi

         literal 32
         icmezW?+HT@Lpnn$kmO?c#!w7oaWVX1NCMJ1Ko$VA_z0~4

  EOF
         git diff --binary file >actual &&
         test_cmp expect actual

> 
>        I think it would probably be OK if this kicked in only when
>        ALLOW_TEXTCONV is set (the default for porcelain), and --binary
>        is not (i.e., when we would otherwise just say "binary
>        files differ").

The user can still use "git diff" (Where auto-detection of UTF-16 kicks in
and replaces "binary files differ" with an UTF-8 diff.
When the user wants a patch, "git diff --binary" will generate a binary patch,
as before.
The only thing which is missing is the line "binary files differ", which may be a
regression. I can re-add it in V2.

> 
>      - similar to (1), carries a run-time cost for each diff, and users
>        may sometimes still see binary diffs
> 
>   3. w-t-e (Lars's patch)
> 
>      - requires no server-side modifications; the diff is plain vanilla
> 
>      - works everywhere you diff, plumbing and porcelain
> 
>      - does require people to add a .gitattributes entry
> 
>      - run-time cost is per-checkout, not per-diff
> 
> So I can see room for (3) to co-exist alongside the others. Between (1)
> and (2), I think (2) is probably the better direction.
> 
> -Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-28  8:20                         ` Torsten Bögershausen
@ 2018-02-28 13:21                           ` Jeff King
  2018-02-28 17:42                             ` Junio C Hamano
  2018-03-04 10:16                             ` Torsten Bögershausen
  0 siblings, 2 replies; 41+ messages in thread
From: Jeff King @ 2018-02-28 13:21 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote:

> >   2. auto-detect utf-16 (your patch)
> >      - Just Works for existing repositories storing utf-16
> > 
> >      - carries some risk of kicking in when people would like it not to
> >        (e.g., when they really do want a binary patch that can be
> >        applied).
> 
> The binary patch is still supported, but that detail may need some more explanation
> in the commit message. Please see  t4066-diff-encoding.sh

Yeah, but if you don't have binary-patches enabled we'd generate a bogus
patch. Which, granted, without that you wouldn't be able to apply the
patch either. But somehow it feels funny to me to generate something
that _looks_ like a patch but you can't actually apply.

I also think we'd want a plan for this to be used consistently in other
diff-like tools. E.g., "git blame" uses textconv for the starting file
content, and it would be nice for this to kick in then, too. Ditto for
things like grep, pickaxe, etc.

I have some patches that reuse some of the textconv infrastructure for
this, which should mostly make it "just work" everywhere. They need a
little more polishing before I post them, but you can take a look at:

  https://github.com/peff/git.git jk/textconv-utf16

if you want.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-28 13:21                           ` Jeff King
@ 2018-02-28 17:42                             ` Junio C Hamano
  2018-03-01  7:49                               ` Jeff King
  2018-03-04 10:16                             ` Torsten Bögershausen
  1 sibling, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2018-02-28 17:42 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

Jeff King <peff@peff.net> writes:

>> The binary patch is still supported, but that detail may need some more explanation
>> in the commit message. Please see  t4066-diff-encoding.sh
>
> Yeah, but if you don't have binary-patches enabled we'd generate a bogus
> patch. Which, granted, without that you wouldn't be able to apply the
> patch either. But somehow it feels funny to me to generate something
> that _looks_ like a patch but you can't actually apply.

True.  And at least you _could_ apply a properly formatted binary
patch to the original.

> I also think we'd want a plan for this to be used consistently in other
> diff-like tools. E.g., "git blame" uses textconv for the starting file
> content, and it would be nice for this to kick in then, too. Ditto for
> things like grep, pickaxe, etc.

You probably do not want to limit your thinking to the generation
side.  It is entirely plausible to have "we deal with diff in this
encoding X" in addition to "the in-repo encoding for this project is
this encoding Y" and "the working tree encoding for this path is Z"
and allow them to interact in "git diff | git apply" pipeline.

"diff/format-patch --stdout/etc." on the upstream would first iconv
Y to X and feed the contents in X to xdiff machinery, which is sent
down the pipe and received by apply, which reads the preimage from
the disk or from the repository.  If doing "apply" without
"--cached/--index", the preimage data from the disk would go through
iconv Z to X.  If doing "apply --cached/--index", the preimage data
from the repo would go through iconv Y to X.  The incoming patch is
in X, so we apply, and the resulting postimage will be re-encoded in
Z in the working tree and Y in the repository.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-27 21:25                       ` Jeff King
  2018-02-27 21:55                         ` Junio C Hamano
  2018-02-28  8:20                         ` Torsten Bögershausen
@ 2018-02-28 20:46                         ` Lars Schneider
  2 siblings, 0 replies; 41+ messages in thread
From: Lars Schneider @ 2018-02-28 20:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin


> On 27 Feb 2018, at 22:25, Jeff King <peff@peff.net> wrote:
> 
> On Tue, Feb 27, 2018 at 10:05:17PM +0100, Torsten Bögershausen wrote:
> 
> Of the three solutions, I think the relative merits are something like
> this:
> 
>  1. baked-in textconv (my patch)
> 
>     - reuses an existing diff feature, so minimal code and not likely to
>       break things
> 
>     - requires people to add a .gitattributes entry
> 
>     - needs work to make bare-repo .gitattributes work (though I think
>       this would be useful for other features, too)
> 
>     - has a run-time cost at each diff to do the conversion
> 
>     - may sometimes annoy people when it doesn't kick in (e.g.,
>       emailed patches from format-patch won't have a readable diff)
> 
>     - doesn't combine with other custom-diff config (e.g., utf-16
>       storing C code should still use diff=c funcname rules, but
>       wouldn't with my patch)
> 
>  2. auto-detect utf-16 (your patch)
>     - Just Works for existing repositories storing utf-16
> 
>     - carries some risk of kicking in when people would like it not to
>       (e.g., when they really do want a binary patch that can be
>       applied).
> 
>       I think it would probably be OK if this kicked in only when
>       ALLOW_TEXTCONV is set (the default for porcelain), and --binary
>       is not (i.e., when we would otherwise just say "binary
>       files differ").
> 
>     - similar to (1), carries a run-time cost for each diff, and users
>       may sometimes still see binary diffs
> 
>  3. w-t-e (Lars's patch)
> 
>     - requires no server-side modifications; the diff is plain vanilla
> 
>     - works everywhere you diff, plumbing and porcelain
> 
>     - does require people to add a .gitattributes entry
> 
>     - run-time cost is per-checkout, not per-diff
> 
> So I can see room for (3) to co-exist alongside the others. Between (1)
> and (2), I think (2) is probably the better direction.

Thanks for the great summary! I agree they could co-exist and people
could use whatever works best for them.

I'll incorporate Eric's feedback and send a w-t-e v9 soonish.

- Lars



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-28 17:42                             ` Junio C Hamano
@ 2018-03-01  7:49                               ` Jeff King
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff King @ 2018-03-01  7:49 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On Wed, Feb 28, 2018 at 09:42:27AM -0800, Junio C Hamano wrote:

> > I also think we'd want a plan for this to be used consistently in other
> > diff-like tools. E.g., "git blame" uses textconv for the starting file
> > content, and it would be nice for this to kick in then, too. Ditto for
> > things like grep, pickaxe, etc.
> 
> You probably do not want to limit your thinking to the generation
> side.  It is entirely plausible to have "we deal with diff in this
> encoding X" in addition to "the in-repo encoding for this project is
> this encoding Y" and "the working tree encoding for this path is Z"
> and allow them to interact in "git diff | git apply" pipeline.
> 
> "diff/format-patch --stdout/etc." on the upstream would first iconv
> Y to X and feed the contents in X to xdiff machinery, which is sent
> down the pipe and received by apply, which reads the preimage from
> the disk or from the repository.  If doing "apply" without
> "--cached/--index", the preimage data from the disk would go through
> iconv Z to X.  If doing "apply --cached/--index", the preimage data
> from the repo would go through iconv Y to X.  The incoming patch is
> in X, so we apply, and the resulting postimage will be re-encoded in
> Z in the working tree and Y in the repository.

I agree that would be convenient, but I have to wonder if all the
complexity is worth it to maintain the idea of a distinct in-repo
representation. It seems like it would open up a ton of corner cases.
And I suspect most people would be happy enough with either a
clean/smudge style worktree conversion or a textconv-style view.

So if somebody wants to work on it, I don't want to stop them. But I
think there's room for the simpler solutions in the meantime.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v7 0/7] convert: add support for different encodings
  2018-02-28 13:21                           ` Jeff King
  2018-02-28 17:42                             ` Junio C Hamano
@ 2018-03-04 10:16                             ` Torsten Bögershausen
  1 sibling, 0 replies; 41+ messages in thread
From: Torsten Bögershausen @ 2018-03-04 10:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, Junio C Hamano, Lars Schneider, git,
	Johannes Sixt, Eric Sunshine, ramsay, Johannes.Schindelin

On 2018-02-28 14:21, Jeff King wrote:
> On Wed, Feb 28, 2018 at 09:20:05AM +0100, Torsten Bögershausen wrote:
> 
>>>   2. auto-detect utf-16 (your patch)
>>>      - Just Works for existing repositories storing utf-16
>>>
>>>      - carries some risk of kicking in when people would like it not to
>>>        (e.g., when they really do want a binary patch that can be
>>>        applied).
>>
>> The binary patch is still supported, but that detail may need some more explanation
>> in the commit message. Please see  t4066-diff-encoding.sh
> 
> Yeah, but if you don't have binary-patches enabled we'd generate a bogus
> patch. Which, granted, without that you wouldn't be able to apply the
> patch either. But somehow it feels funny to me to generate something
> that _looks_ like a patch but you can't actually apply.
> 
> I also think we'd want a plan for this to be used consistently in other
> diff-like tools. E.g., "git blame" uses textconv for the starting file
> content, and it would be nice for this to kick in then, too. Ditto for
> things like grep, pickaxe, etc.
> 
> I have some patches that reuse some of the textconv infrastructure for
> this, which should mostly make it "just work" everywhere. They need a
> little more polishing before I post them, but you can take a look at:
> 
>   https://github.com/peff/git.git jk/textconv-utf16
> 
> if you want.
> 
> -Peff
> 

Thanks for your work (I actually found some time to take look)

I am looking at the code to put 2 or 3 things on top of it:
- test case(s)
- documentation
- teach diff to add a line "b is converted to UTF-8 from UTF-16"
- teach apply to reads & understands the encoding line and throws
  in a "reencode_string_len() like your patch does

This would keep "git diff | git apply" happy.
All in all the changes do not look too invasive, at least from my point of view.




^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-03-04 10:17 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-15 15:27 [PATCH v7 0/7] convert: add support for different encodings lars.schneider
2018-02-15 15:27 ` [PATCH v7 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-16 12:55   ` Ævar Arnfjörð Bjarmason
2018-02-16 18:45     ` Jeff King
2018-02-16 19:30       ` Junio C Hamano
2018-02-15 15:27 ` [PATCH v7 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-15 15:27 ` [PATCH v7 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-15 15:27 ` [PATCH v7 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-15 15:27 ` [PATCH v7 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-15 15:27 ` [PATCH v7 6/7] convert: add tracing for " lars.schneider
2018-02-15 15:27 ` [PATCH v7 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-15 20:03 ` [PATCH v7 0/7] convert: add support for different encodings Junio C Hamano
2018-02-15 22:09   ` Jeff King
2018-02-16 18:55     ` Junio C Hamano
2018-02-16 19:25       ` Jeff King
2018-02-16 19:27         ` Jeff King
2018-02-16 19:41           ` Junio C Hamano
2018-02-21 18:06       ` Lars Schneider
2018-02-16 14:42   ` Lars Schneider
2018-02-16 16:58     ` Torsten Bögershausen
2018-02-22 20:00       ` Lars Schneider
2018-02-22 20:12         ` Jeff King
2018-02-23 16:35         ` Junio C Hamano
2018-02-23 20:11           ` Junio C Hamano
2018-02-24 15:18             ` Lars Schneider
2018-02-26  1:44               ` Jeff King
2018-02-26 17:35                 ` Torsten Bögershausen
2018-02-26 20:46                   ` Jeff King
2018-02-27 21:05                     ` Torsten Bögershausen
2018-02-27 21:25                       ` Jeff King
2018-02-27 21:55                         ` Junio C Hamano
2018-02-27 21:58                           ` Jeff King
2018-02-27 22:10                             ` Junio C Hamano
2018-02-27 22:20                               ` Jeff King
2018-02-28  8:20                         ` Torsten Bögershausen
2018-02-28 13:21                           ` Jeff King
2018-02-28 17:42                             ` Junio C Hamano
2018-03-01  7:49                               ` Jeff King
2018-03-04 10:16                             ` Torsten Bögershausen
2018-02-28 20:46                         ` Lars Schneider
2018-02-16 19:04     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).