[PATCH v8 0/7] convert: add support for different encodings

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v8 0/7] convert: add support for different encodings
@ 2018-02-24 16:27 lars.schneider
  2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v7:

* make it clearer in the documentation that Git stores content "as-is"
  by default. Content is only stored in UTF-8 if w-t-e is used (Junio)
* add test case for $GIT_DIR/info/attributes support (Junio)

Thanks,
Lars

  RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
   v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
   v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
   v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
   v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
   v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/
   v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2758a2da29
Checkout: git fetch https://github.com/larsxschneider/git encoding-v8 && git checkout 2758a2da29


### Interdiff (v7..v8):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 10cb37795d..11315054f4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -275,11 +275,11 @@ few exceptions.  Even though...
 `working-tree-encoding`
 ^^^^^^^^^^^^^^^^^^^^^^^

-Git recognizes files encoded with ASCII or one of its supersets (e.g.
-UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
-interpreted as binary and consequently built-in Git text processing
-tools (e.g. 'git diff') as well as most Git web front ends do not
-visualize the content.
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.

 In these cases you can tell Git the encoding of a file in the working
 directory with the `working-tree-encoding` attribute. If a file with this
@@ -291,12 +291,24 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Third party Git implementations that do not support the
-  `working-tree-encoding` attribute will checkout the respective files
-  UTF-8 encoded and not in the expected encoding. Consequently, these
-  files will appear different which typically causes trouble. This is
-  in particular the case for older Git versions and alternative Git
-  implementations such as JGit or libgit2 (as of February 2018).
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e4717402a5..e34c21eb29 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -13,8 +13,11 @@ test_expect_success 'setup test repo' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
 	cp test.utf16.raw test.utf16 &&
+	cp test.utf32.raw test.utf32 &&

+	# Add only UTF-16 file, we will add the UTF-32 file later
 	git add .gitattributes test.utf16 &&
 	git commit -m initial
 '
@@ -24,7 +27,7 @@ test_expect_success 'ensure UTF-8 is stored in Git' '
 	test_cmp_bin test.utf8.raw test.utf16.git &&

 	# cleanup
-	rm test.utf8.raw test.utf16.git
+	rm test.utf16.git
 '

 test_expect_success 're-encode to UTF-16 on checkout' '
@@ -36,6 +39,19 @@ test_expect_success 're-encode to UTF-16 on checkout' '
 	rm test.utf16.raw
 '

+test_expect_success 'check $GIT_DIR/info/attributes support' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+
+	git add test.utf32 &&
+
+	git cat-file -p :test.utf32 >test.utf32.git &&
+	test_cmp_bin test.utf8.raw test.utf32.git &&
+
+	# cleanup
+	git reset --hard HEAD &&
+	rm test.utf8.raw test.utf32.raw test.utf32.git
+'
+
 test_expect_success 'check prohibited UTF BOM' '
 	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
 	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&


### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  86 +++++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 269 +++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 700 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
@ 2018-02-24 16:27 ` lars.schneider
  2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 2/7] strbuf: add xstrdup_toupper()
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
  2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-02-24 16:27 ` lars.schneider
  2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
  2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-02-24 16:27 ` lars.schneider
  2018-02-25  3:41   ` Eric Sunshine
  2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-02-24 16:27 ` lars.schneider
  2018-02-25  3:52   ` Eric Sunshine
  2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-02-24 16:27 ` lars.schneider
  2018-02-25  7:15   ` Eric Sunshine
  2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
  2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  6 siblings, 1 reply; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:27 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  78 ++++++++++++++
 convert.c                        | 157 ++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 226 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 462 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..eddeaee1f7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,84 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+------------------------
+*.proj		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+------------------------
+*.proj 		text working-tree-encoding=UTF-16LE eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.proj
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..d20c341b6d 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited in '%s' if encoded as %s");
+		/*
+		 * This advise is shown for UTF-??BE and UTF-??LE encodings.
+		 * We truncate the encoding name to 6 chars with %.6s to cut
+		 * off the last two "byte order" characters.
+		 */
+		const char *advise_msg = _(
+			"The file '%s' contains a byte order mark (BOM). "
+			"Please use %.6s as working-tree-encoding.");
+		advise(advise_msg, path, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required in '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"The file '%s' is missing a byte order mark (BOM). "
+			"Please use %sBE or %sLE (depending on the byte order) "
+			"as working-tree-encoding.");
+		advise(advise_msg, path, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(*enc));
+	/*
+	 * Ensure encoding names are always upper case (e.g. UTF-8) to
+	 * simplify subsequent string comparisons.
+	 */
+	enc->name = xstrdup_toupper(value);
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1167,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1176,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1201,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1282,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1312,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1344,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1816,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.working_tree_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..a55bbcb6d3
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,226 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	cp test.utf16.raw test.utf16 &&
+	cp test.utf32.raw test.utf32 &&
+
+	# Add only UTF-16 file, we will add the UTF-32 file later
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+
+	# cleanup
+	rm test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check $GIT_DIR/info/attributes support' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+
+	git add test.utf32 &&
+
+	git cat-file -p :test.utf32 >test.utf32.git &&
+	test_cmp_bin test.utf8.raw test.utf32.git &&
+
+	# cleanup
+	git reset --hard HEAD &&
+	rm test.utf8.raw test.utf32.raw test.utf32.git
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	cat >expectIndexLF <<-\EOF &&
+		i/lf    w/-text attr/text             	eol.utf16
+	EOF
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# Although the file has CRLF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	# The file LF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual&&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no working tree encoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 6/7] convert: add tracing for 'working-tree-encoding' attribute
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-02-24 16:28 ` lars.schneider
  2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  6 siblings, 0 replies; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 25 +++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index d20c341b6d..c4e2fd5fa5 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index a55bbcb6d3..8666669b2d 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
@ 2018-02-24 16:28 ` lars.schneider
  2018-02-25 19:50   ` Eric Sunshine
  6 siblings, 1 reply; 20+ messages in thread
From: lars.schneider @ 2018-02-24 16:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/config.txt         |  6 ++++
 Documentation/gitattributes.txt  |  8 +++++
 config.c                         |  5 +++
 convert.c                        | 74 ++++++++++++++++++++++++++++++++++++++++
 convert.h                        |  1 +
 environment.c                    |  1 +
 t/t0028-working-tree-encoding.sh | 41 ++++++++++++++++++++++
 7 files changed, 136 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
 	This variable can be set to 'input',
 	in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+	A comma separated list of encodings that Git performs UTF-8 round
+	trip checks on if they are used in an `working-tree-encoding`
+	attribute (see linkgit:gitattributes[5]). The default value is
+	`SHIFT-JIS`.
+
 core.symlinks::
 	If false, symbolic links are checked out as small plain files that
 	contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index eddeaee1f7..11315054f4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -310,6 +310,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkroundtripencoding")) {
+		check_roundtrip_encoding = xstrdup(value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.notesref")) {
 		notes_ref_name = xstrdup(value);
 		return 0;
diff --git a/convert.c b/convert.c
index c4e2fd5fa5..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
 	strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+	/*
+	 * check_roundtrip_encoding contains a string of space and/or
+	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+	 * Search for the given encoding in that string.
+	 */
+	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+	const char *next = found + strlen(enc_name);
+	int len = strlen(check_roundtrip_encoding);
+	return (found && (
+			/*
+			 * check that the found encoding is at the
+			 * beginning of check_roundtrip_encoding or
+			 * that it is prefixed with a space or comma
+			 */
+			found == check_roundtrip_encoding || (
+				found > check_roundtrip_encoding &&
+				(*(found-1) == ' ' || *(found-1) == ',')
+			)
+		) && (
+			/*
+			 * check that the found encoding is at the
+			 * end of check_roundtrip_encoding or
+			 * that it is suffixed with a space or comma
+			 */
+			next == check_roundtrip_encoding + len || (
+				next < check_roundtrip_encoding + len &&
+				(*next == ' ' || *next == ',')
+			)
+		));
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
+	/*
+	 * UTF supports lossless conversion round tripping [1] and conversions
+	 * between UTF and other encodings are mostly round trip safe as
+	 * Unicode aims to be a superset of all other character encodings.
+	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+	 * trip issues [2]. Check the round trip conversion for all encodings
+	 * listed in core.checkRoundtripEncoding.
+	 *
+	 * The round trip check is only performed if content is written to Git.
+	 * This ensures that no information is lost during conversion to/from
+	 * the internal UTF-8 representation.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without an iconv error. Iconv errors
+	 * are already caught above.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			die(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
 }
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };
 
 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 8666669b2d..e34c21eb29 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -225,4 +225,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	git reset --hard $BEFORE_STATE
 '
 
+test_expect_success 'check roundtrip encoding' '
+	text="hallo there!\nroundtrip test here!" &&
+	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+	# SHIFT-JIS encoded files are round-trip checked by default...
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# ... unless we overwrite the Git config!
+	test_config core.checkRoundtripEncoding "garbage" &&
+	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	test_unconfig core.checkRoundtripEncoding &&
+	git reset &&
+
+	# UTF-16 encoded files should not be round-trip checked by default...
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	# (here we also check that the casing of the encoding is irrelevant)
+	test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# cleanup
+	rm roundtrip.shift roundtrip.utf16 &&
+	git reset --hard HEAD
+'
+
 test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-02-25  3:41   ` Eric Sunshine
  2018-02-25 11:35     ` Lars Schneider
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sunshine @ 2018-02-25  3:41 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
> or UTF-32LE a BOM must not be used [1]. The function returns true if
> this is the case.
>
> [1] http://unicode.org/faq/utf_bom.html#bom10
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/utf8.c b/utf8.c
> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +       return (
> +         (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
> +         (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +          has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +       ) || (
> +         (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
> +         (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +          has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +       );
> +}

Is this interpretation correct? When I read [1], I interpret it as
saying that no BOM _of any sort_ should be present when the encoding
is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This
code, on the other hand, only checks for BOMs corresponding to the
declared size (16 or 32 bits).

I suppose the intention of [1] is to detect a mismatch between the
declared encoding and how the stream is actually encoded. The check
implemented here will fail to detect a mismatch between, say, declared
encoding UTF-16BE and actual encoding UTF-32BE.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-02-25  3:52   ` Eric Sunshine
  2018-02-25 11:41     ` Lars Schneider
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sunshine @ 2018-02-25  3:52 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> is_missing_required_utf_bom() function returns true if a required BOM
> is missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/utf8.h b/utf8.h
> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
> +/*
> + * If the endianness is not defined in the encoding name, then we
> + * require a BOM. The function returns true if a required BOM is missing.
> + *
> + * The Unicode standard instructs to assume big-endian if there
> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
> + * encoding standard used in HTML5 recommends to assume
> + * little-endian to "deal with deployed content" [3].

Perhaps you could tack on to the comment here the final bit of
explanation from the commit message which ties these conflicting
recommendations together. In particular:

    Therefore, strictly requiring a BOM seems to be the
    safest option for content in Git.

> + */
> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-02-25  7:15   ` Eric Sunshine
  2018-02-27 11:16     ` Lars Schneider
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sunshine @ 2018-02-25  7:15 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> @@ -272,6 +272,84 @@ few exceptions.  Even though...
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
> +  versions (as of March 2018) do not support the `working-tree-encoding`
> +  attribute. If you decide to use the `working-tree-encoding` attribute
> +  in your repository, then it is strongly recommended to ensure that all
> +  clients working with the repository support it.
> +
> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
> +  stored as UTF-8 internally. A client without `working-tree-encoding`
> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
> +  typically cause trouble for the users of this file.

The above paragraph is giving an example of the scenario described in
the paragraph above it. It might, therefore, be clearer to start this
paragraph with "For example,...". Also, as an aid to non-Windows
developers, it might be helpful to introduce ".proj" files as
"Microsoft Visual Studio `*.proj` files...".

> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static struct encoding {
> +       const char *name;
> +       struct encoding *next;
> +} *encoding, **encoding_tail;

Why does this struct need to be a linked-list since it contains only a
name? In fact, why do the struct and linked list exist at all? Back in
v1 when the struct held more members, it made sense to place them in a
collection so they could be re-used, but today it just seems an
unnecessary artifact which buys you nothing.

Is the plan that some future patch series will add members to the
struct? If not, then it seems that it should be retired altogether.

> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +                        struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> +       [...]
> +       if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +               const char *error_msg = _(
> +                       "BOM is prohibited in '%s' if encoded as %s");
> +               /*
> +                * This advise is shown for UTF-??BE and UTF-??LE encodings.

s/advise/advice/

> +                * We truncate the encoding name to 6 chars with %.6s to cut
> +                * off the last two "byte order" characters.
> +                */
> +               const char *advise_msg = _(
> +                       "The file '%s' contains a byte order mark (BOM). "
> +                       "Please use %.6s as working-tree-encoding.");
> +               advise(advise_msg, path, enc->name);
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(error_msg, path, enc->name);
> +               else
> +                       error(error_msg, path, enc->name);

What is the intended behavior in the non-die() case? As implemented,
this is still going to attempt the conversion even though it threw an
error. Should it be returning 0 here instead? Same question for the
couple additional cases below.

> +
> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
> +               const char *error_msg = _(
> +                       "BOM is required in '%s' if encoded as %s");
> +               const char *advise_msg = _(
> +                       "The file '%s' is missing a byte order mark (BOM). "
> +                       "Please use %sBE or %sLE (depending on the byte order) "
> +                       "as working-tree-encoding.");
> +               advise(advise_msg, path, enc->name, enc->name);
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(error_msg, path, enc->name);
> +               else
> +                       error(error_msg, path, enc->name);
> +       }

For a function which presumably is agnostic about the working-tree
encoding (supporting whatever iconv does), it nevertheless has
unusually intimate knowledge about UTF BOMs, in general, and the
internals of has_prohibited_utf_bom() and
is_missing_required_utf_bom(), in particular. It might be cleaner, and
would simplify this function, if all this UTF BOM-specific gunk was
moved elsewhere and generalized into a "validate conversion" function.

As this series is already at v8, perhaps such cleanup could be done as
a follow-on series (by someone) rather than re-rolling.

> +       dst = reencode_string_len(src, src_len, default_encoding, enc->name,
> +                                 &dst_len);
> +       if (!dst) {
> +               /*
> +                * We could add the blob "as-is" to Git. However, on checkout
> +                * we would try to reencode to the original encoding. This
> +                * would fail and we would leave the user with a messed-up
> +                * working tree. Let's try to avoid this by screaming loud.
> +                */
> +               const char* msg = _("failed to encode '%s' from %s to %s");
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(msg, path, enc->name, default_encoding);
> +               else
> +                       error(msg, path, enc->name, default_encoding);
> +       }
> +
> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +       return 1;
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
> +                             struct strbuf *buf, struct encoding *enc)
> +{
> +       [...]
> +       dst = reencode_string_len(src, src_len, enc->name, default_encoding,
> +                                 &dst_len);
> +       if (!dst) {
> +               error("failed to encode '%s' from %s to %s",
> +                       path, enc->name, default_encoding);

The last two arguments need to be swapped.

> +               return 0;
> +       }
> +
> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +       return 1;
> +}
> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +           !strlen(value))
> +               return NULL;
> +
> +       for (enc = encoding; enc; enc = enc->next)
> +               if (!strcasecmp(value, enc->name))
> +                       return enc;
> +
> +       /* Don't encode to the default encoding */
> +       if (!strcasecmp(value, default_encoding))
> +               return NULL;

You could handle this easy-to-detect case before attempting the more
expensive loop just above. (Not necessarily worth a re-roll.)

> +       enc = xcalloc(1, sizeof(*enc));
> +       /*
> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
> +        * simplify subsequent string comparisons.
> +        */
> +       enc->name = xstrdup_toupper(value);
> +       *encoding_tail = enc;
> +       encoding_tail = &(enc->next);
> +
> +       return enc;
> +}
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -0,0 +1,226 @@
> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +       git cat-file -p :test.utf16 >test.utf16.git &&
> +       test_cmp_bin test.utf8.raw test.utf16.git &&
> +
> +       # cleanup
> +       rm test.utf16.git
> +'

This cleanup won't take place if test_cmp_bin() fails. Instead,
cleanups are typically handled by test_when_finished() to ensure that
the cleanup action occurs even if the test fails.

    test_expect_success 'ensure UTF-8 is stored in Git' '
        test_when_finished "rm -f  test.utf16.git" &&
        ...

Same comment for remaining tests.

> +test_expect_success 'check prohibited UTF BOM' '
> +       printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
> +       printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
> +       printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
> +       printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
> +
> +       printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
> +       printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
> +       printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
> +       printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&

These resources seem to be used by multiple tests. Perhaps create them
in a distinct "setup" test instead of bundling them in this test?

> +       echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
> +       echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
> +       echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
> +       echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
> +
> +       # Here we add a UTF-16 files with BOM (big-endian and little-endian)
> +       # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
> +       # the BOM is prohibited.
> +       cp bebom.utf16be.raw bebom.utf16be &&
> +       test_must_fail git add bebom.utf16be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +       cp lebom.utf16le.raw lebom.utf16be &&
> +       test_must_fail git add lebom.utf16be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +       cp bebom.utf16be.raw bebom.utf16le &&
> +       test_must_fail git add bebom.utf16le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +       cp lebom.utf16le.raw lebom.utf16le &&
> +       test_must_fail git add lebom.utf16le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +       # ... and the same for UTF-32
> +       cp bebom.utf32be.raw bebom.utf32be &&
> +       test_must_fail git add bebom.utf32be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +       cp lebom.utf32le.raw lebom.utf32be &&
> +       test_must_fail git add lebom.utf32be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +       cp bebom.utf32be.raw bebom.utf32le &&
> +       test_must_fail git add bebom.utf32le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +       cp lebom.utf32le.raw lebom.utf32le &&
> +       test_must_fail git add lebom.utf32le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +       # cleanup
> +       git reset --hard HEAD
> +'

Clumping all these checks into the same test makes it difficult to
determine which one failed (if one does fail). Also, the amount of
copy/paste code is easy to get wrong and a bit onerous to review.
Automating via for-loops would address these concerns. For instance,
to check all combinations (not just 8 combinations tested above):

    for i in 16be 16le 32be 32le
    do
        for j in 16be 16le 32be 32le
        do
            test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
                test_when_finished "git reset --hard HEAD" &&
                cp bebom.utf$i.raw bebom.utf$j &&
                test_must_fail git add bebom.utf$j 2>err.out &&
                J=$(echo $j | tr bel BEL) &&
                test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
            '
        done
    done

Alternately, the test could be moved to a function and called for each
combination:

    check_prohibited_bom () {
        i=$1
        j=$2
        test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
            ...
        '
    }
    check_prohibited_bom 16be 16be
    check_prohibited_bom 16be 16le
    ...

Same comment regarding copy/paste in tests below.

Probably not worth a re-roll.

> +test_expect_success 'check required UTF BOM' '
> +       echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
> +
> +       cp nobom.utf16be.raw nobom.utf16 &&
> +       test_must_fail git add nobom.utf16 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +       cp nobom.utf16le.raw nobom.utf16 &&
> +       test_must_fail git add nobom.utf16 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +       cp nobom.utf32be.raw nobom.utf32 &&
> +       test_must_fail git add nobom.utf32 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +       cp nobom.utf32le.raw nobom.utf32 &&
> +       test_must_fail git add nobom.utf32 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +       # cleanup
> +       rm nobom.utf16 nobom.utf32 &&
> +       git reset --hard HEAD
> +'

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-25  3:41   ` Eric Sunshine
@ 2018-02-25 11:35     ` Lars Schneider
  2018-02-27  5:17       ` Eric Sunshine
  0 siblings, 1 reply; 20+ messages in thread
From: Lars Schneider @ 2018-02-25 11:35 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
>> Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
>> or UTF-32LE a BOM must not be used [1]. The function returns true if
>> this is the case.
>> 
>> [1] http://unicode.org/faq/utf_bom.html#bom10
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
>> +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +       return (
>> +         (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
>> +         (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +          has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +       ) || (
>> +         (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
>> +         (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +          has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +       );
>> +}
> 
> Is this interpretation correct? When I read [1], I interpret it as
> saying that no BOM _of any sort_ should be present when the encoding
> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.

Correct!

> This
> code, on the other hand, only checks for BOMs corresponding to the
> declared size (16 or 32 bits).

Hmm. Interesting thought. You are saying that my code won't complain if
a document declared as UTF-16LE has a UTF32-LE BOM, correct? I would say
this is correct behavior in context of this function. This function assumes
that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
declared with respect to its BOM in the .gitattributes. Would this
comment make it more clear to you?

	/*
	 * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
	 * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
	 * The function returns true if this rule is violated.
	 *
	 * [1] http://unicode.org/faq/utf_bom.html#bom10
	 */


I think what you are referring to is a different class of error and
would therefore warrant its own checker function. Would you agree?


> I suppose the intention of [1] is to detect a mismatch between the
> declared encoding and how the stream is actually encoded. The check
> implemented here will fail to detect a mismatch between, say, declared
> encoding UTF-16BE and actual encoding UTF-32BE.

As stated above the intention is to detect wrong BOMs! I think we cannot 
detect the "declared as UTF-16BE but actually UTF-32BE" error.

Consider this:

printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
0000000   \0   t  \0   e  \0   s  \0   t
0000010

In the first step we "encode" the string to UTF-32BE and then we "decode" it as
UTF-16BE. The result is valid although not correct. Does this make sense?

Thanks,
Lars



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-25  3:52   ` Eric Sunshine
@ 2018-02-25 11:41     ` Lars Schneider
  0 siblings, 0 replies; 20+ messages in thread
From: Lars Schneider @ 2018-02-25 11:41 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 25 Feb 2018, at 04:52, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> is_missing_required_utf_bom() function returns true if a required BOM
>> is missing.
>> 
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/utf8.h b/utf8.h
>> @@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
>> +/*
>> + * If the endianness is not defined in the encoding name, then we
>> + * require a BOM. The function returns true if a required BOM is missing.
>> + *
>> + * The Unicode standard instructs to assume big-endian if there
>> + * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
>> + * encoding standard used in HTML5 recommends to assume
>> + * little-endian to "deal with deployed content" [3].
> 
> Perhaps you could tack on to the comment here the final bit of
> explanation from the commit message which ties these conflicting
> recommendations together. In particular:
> 
>    Therefore, strictly requiring a BOM seems to be the
>    safest option for content in Git.

Agreed. I'll change it.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
@ 2018-02-25 19:50   ` Eric Sunshine
  2018-03-04 19:08     ` Lars Schneider
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sunshine @ 2018-02-25 19:50 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sat, Feb 24, 2018 at 11:28 AM,  <lars.schneider@autodesk.com> wrote:
> UTF supports lossless conversion round tripping and conversions between
> UTF and other encodings are mostly round trip safe as Unicode aims to be
> a superset of all other character encodings. However, certain encodings
> (e.g. SHIFT-JIS) are known to have round trip issues [1].
>
> Add 'core.checkRoundtripEncoding', which contains a comma separated
> list of encodings, to define for what encodings Git should check the
> conversion round trip if they are used in the 'working-tree-encoding'
> attribute.
>
> Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.
>
> [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
> +static int check_roundtrip(const char* enc_name)
> +{
> +       /*
> +        * check_roundtrip_encoding contains a string of space and/or
> +        * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
> +        * Search for the given encoding in that string.
> +        */
> +       const char *found = strcasestr(check_roundtrip_encoding, enc_name);
> +       const char *next = found + strlen(enc_name);

Is this pointer arithmetic undefined behavior (according to the C
standard) if NULL is returned by strcasestr()? If so, how comfortable
are we with this? Perhaps if you add an 'if' into the mix, you can
avoid it:

    if (found) {
        const char *next = found + strlen(enc_name);
        return ...long complicated expression...;
    }
    return false;

> +       int len = strlen(check_roundtrip_encoding);
> +       return (found && (
> +                       /*
> +                        * check that the found encoding is at the
> +                        * beginning of check_roundtrip_encoding or
> +                        * that it is prefixed with a space or comma
> +                        */
> +                       found == check_roundtrip_encoding || (
> +                               found > check_roundtrip_encoding &&
> +                               (*(found-1) == ' ' || *(found-1) == ',')

Although the 'found > check_roundtrip_encoding' expression is
effectively dead code in this case, it does document that the
programmer didn't just blindly use '*(found-1)' without taking
boundary conditions into consideration.

(It's dead code because, at this point, we know that strcasestr()
didn't return NULL and we know that 'found' doesn't equal
'check_roundtrip_encoding', therefore it _must_ be greater than
'check_roundtrip_encoding'.)

Just an observation; not a request to remove the dead code since it
has some documentation value.

> +                       )
> +               ) && (
> +                       /*
> +                        * check that the found encoding is at the
> +                        * end of check_roundtrip_encoding or
> +                        * that it is suffixed with a space or comma
> +                        */
> +                       next == check_roundtrip_encoding + len || (
> +                               next < check_roundtrip_encoding + len &&
> +                               (*next == ' ' || *next == ',')
> +                       )
> +               ));
> +}
> @@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
> +       if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
> +               re_src = reencode_string_len(dst, dst_len,
> +                                            enc->name, default_encoding,
> +                                            &re_src_len);
> +
> +               trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
> +               trace_encoding("reencoded source", path, enc->name,
> +                              re_src, re_src_len);
> +
> +               if (!re_src || src_len != re_src_len ||
> +                   memcmp(src, re_src, src_len)) {
> +                       const char* msg = _("encoding '%s' from %s to %s and "
> +                                           "back is not the same");
> +                       die(msg, path, enc->name, default_encoding);

Last two arguments need to be swapped.

> +               }
> +
> +               free(re_src);
> +       }
> +
>         strbuf_attach(buf, dst, dst_len, dst_len + 1);
>         return 1;
>  }
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -225,4 +225,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
> +test_expect_success 'check roundtrip encoding' '
> +       text="hallo there!\nroundtrip test here!" &&
> +       printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> +       printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> +       echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
> +
> +       # SHIFT-JIS encoded files are round-trip checked by default...
> +       GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&

Why redirect to /dev/null? If something does go wrong somewhere, the
more output available for debugging the problem, the better, so
throwing it away unnecessarily seems contraindicated.

> +       git reset &&
> +
> +       # ... unless we overwrite the Git config!
> +       test_config core.checkRoundtripEncoding "garbage" &&
> +       ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&
> +       test_unconfig core.checkRoundtripEncoding &&

The "unconfig" won't take place if the test fails. Instead of
test_config/test_unconfig, you could use '-c' to set the config
transiently for the git-add operation:

    ! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage add ...

> +       git reset &&
> +
> +       # UTF-16 encoded files should not be round-trip checked by default...
> +       ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # ... unless we tell Git to check it!
> +       test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # ... unless we tell Git to check it!
> +       # (here we also check that the casing of the encoding is irrelevant)
> +       test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # cleanup
> +       rm roundtrip.shift roundtrip.utf16 &&
> +       git reset --hard HEAD
> +'

Same comment as for patch 5/7: This cleanup won't happen if the test
fails. Instead, use test_when_finished() earlier in the test:

    test_expect_success 'check roundtrip encoding' '
        test_when_finished "rm -f roundtrip.shift roundtrip.utf16; git
reset --hard HEAD" &&
        ...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-25 11:35     ` Lars Schneider
@ 2018-02-27  5:17       ` Eric Sunshine
  2018-02-28 21:34         ` Lars Schneider
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sunshine @ 2018-02-27  5:17 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin

On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> Is this interpretation correct? When I read [1], I interpret it as
>> saying that no BOM _of any sort_ should be present when the encoding
>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>
> Correct!
>
>> This code, on the other hand, only checks for BOMs corresponding
>> to the declared size (16 or 32 bits).
>
> Hmm. Interesting thought. You are saying that my code won't complain if
> a document declared as UTF-16LE has a UTF32-LE BOM, correct?

Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM.

My observation was more general in that [1] seems to say that there
should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
or UTF-32LE is declared.

> I would say
> this is correct behavior in context of this function. This function assumes
> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
> declared with respect to its BOM in the .gitattributes. Would this
> comment make it more clear to you?
>         /*
>          * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>          * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
>          * The function returns true if this rule is violated.
>          *
>          * [1] http://unicode.org/faq/utf_bom.html#bom10
>          */
> I think what you are referring to is a different class of error and
> would therefore warrant its own checker function. Would you agree?

I don't understand to what different class of error you refer. The
FAQ[1] seems pretty clear to me in that if one of those declarations
is used explicitly, then there should be _no_ BOM, period. It doesn't
say anything about allowing a BOM for a differently-sized encoding (16
vs 32).

If I squint very hard, I _guess_ I can see how you interpret [1] with
the more narrow meaning of the restriction applying only to a BOM of
the same size as the declared encoding, though reading it that way
doesn't come easily to me.

>> I suppose the intention of [1] is to detect a mismatch between the
>> declared encoding and how the stream is actually encoded. The check
>> implemented here will fail to detect a mismatch between, say, declared
>> encoding UTF-16BE and actual encoding UTF-32BE.
>
> As stated above the intention is to detect wrong BOMs! I think we cannot
> detect the "declared as UTF-16BE but actually UTF-32BE" error.
>
> Consider this:
>
> printf "test" | iconv -f UTF-8 -t UTF-32BE | iconv -f UTF-16BE -t UTF-8 | od -c
> 0000000   \0   t  \0   e  \0   s  \0   t
> 0000010
>
> In the first step we "encode" the string to UTF-32BE and then we "decode" it as
> UTF-16BE. The result is valid although not correct. Does this make sense?

I'm probably being dense, but I don't understand what this is trying
to illustrate in relation to has_prohibited_utf_bom().

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-25  7:15   ` Eric Sunshine
@ 2018-02-27 11:16     ` Lars Schneider
  2018-02-28 21:20       ` Eric Sunshine
  0 siblings, 1 reply; 20+ messages in thread
From: Lars Schneider @ 2018-02-27 11:16 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 25 Feb 2018, at 08:15, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> @@ -272,6 +272,84 @@ few exceptions.  Even though...
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
>> +  versions (as of March 2018) do not support the `working-tree-encoding`
>> +  attribute. If you decide to use the `working-tree-encoding` attribute
>> +  in your repository, then it is strongly recommended to ensure that all
>> +  clients working with the repository support it.
>> +
>> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
>> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
>> +  stored as UTF-8 internally. A client without `working-tree-encoding`
>> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
>> +  typically cause trouble for the users of this file.
> 
> The above paragraph is giving an example of the scenario described in
> the paragraph above it. It might, therefore, be clearer to start this
> paragraph with "For example,...". Also, as an aid to non-Windows
> developers, it might be helpful to introduce ".proj" files as
> "Microsoft Visual Studio `*.proj` files...".

Agreed. How about this?

  For example, Microsoft Visual Studio resources files (`*.rc`) or
  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16. 
  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with 
  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
  stored as UTF-8 internally. A client without `working-tree-encoding`
  support will checkout `foo.ps1` as UTF-8 encoded file. This will
  typically cause trouble for the users of this file.


>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static struct encoding {
>> +       const char *name;
>> +       struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> Why does this struct need to be a linked-list since it contains only a
> name? In fact, why do the struct and linked list exist at all? Back in
> v1 when the struct held more members, it made sense to place them in a
> collection so they could be re-used, but today it just seems an
> unnecessary artifact which buys you nothing.
> 
> Is the plan that some future patch series will add members to the
> struct? If not, then it seems that it should be retired altogether.

Absolutely! Thank you!


>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +                        struct strbuf *buf, struct encoding *enc, int conv_flags)
>> +{
>> +       [...]
>> +       if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +               const char *error_msg = _(
>> +                       "BOM is prohibited in '%s' if encoded as %s");
>> +               /*
>> +                * This advise is shown for UTF-??BE and UTF-??LE encodings.
> 
> s/advise/advice/

Agreed!


>> +                * We truncate the encoding name to 6 chars with %.6s to cut
>> +                * off the last two "byte order" characters.
>> +                */
>> +               const char *advise_msg = _(
>> +                       "The file '%s' contains a byte order mark (BOM). "
>> +                       "Please use %.6s as working-tree-encoding.");
>> +               advise(advise_msg, path, enc->name);
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(error_msg, path, enc->name);
>> +               else
>> +                       error(error_msg, path, enc->name);
> 
> What is the intended behavior in the non-die() case? As implemented,
> this is still going to attempt the conversion even though it threw an
> error. Should it be returning 0 here instead? Same question for the
> couple additional cases below.

Good question. My intention was to print an error and then let iconv try
the conversion anyways. I agree that this is bogus. Let's return in case
of an error right away.


>> +
>> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>> +               const char *error_msg = _(
>> +                       "BOM is required in '%s' if encoded as %s");
>> +               const char *advise_msg = _(
>> +                       "The file '%s' is missing a byte order mark (BOM). "
>> +                       "Please use %sBE or %sLE (depending on the byte order) "
>> +                       "as working-tree-encoding.");
>> +               advise(advise_msg, path, enc->name, enc->name);
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(error_msg, path, enc->name);
>> +               else
>> +                       error(error_msg, path, enc->name);
>> +       }
> 
> For a function which presumably is agnostic about the working-tree
> encoding (supporting whatever iconv does), it nevertheless has
> unusually intimate knowledge about UTF BOMs, in general, and the
> internals of has_prohibited_utf_bom() and
> is_missing_required_utf_bom(), in particular. It might be cleaner, and
> would simplify this function, if all this UTF BOM-specific gunk was
> moved elsewhere and generalized into a "validate conversion" function.

Agreed! How about this?

	/*
	 * If we convert to an UTF encoding, then check for common BOM errors.
	 */
	if (!memcmp("UTF-", enc, 4))
		if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
			return 0;


>> +       dst = reencode_string_len(src, src_len, default_encoding, enc->name,
>> +                                 &dst_len);
>> +       if (!dst) {
>> +               /*
>> +                * We could add the blob "as-is" to Git. However, on checkout
>> +                * we would try to reencode to the original encoding. This
>> +                * would fail and we would leave the user with a messed-up
>> +                * working tree. Let's try to avoid this by screaming loud.
>> +                */
>> +               const char* msg = _("failed to encode '%s' from %s to %s");
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(msg, path, enc->name, default_encoding);
>> +               else
>> +                       error(msg, path, enc->name, default_encoding);
>> +       }
>> +
>> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +       return 1;
>> +}
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
>> +                             struct strbuf *buf, struct encoding *enc)
>> +{
>> +       [...]
>> +       dst = reencode_string_len(src, src_len, enc->name, default_encoding,
>> +                                 &dst_len);
>> +       if (!dst) {
>> +               error("failed to encode '%s' from %s to %s",
>> +                       path, enc->name, default_encoding);
> 
> The last two arguments need to be swapped.

Nice catch!


>> +               return 0;
>> +       }
>> +
>> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +       return 1;
>> +}
>> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +           !strlen(value))
>> +               return NULL;
>> +
>> +       for (enc = encoding; enc; enc = enc->next)
>> +               if (!strcasecmp(value, enc->name))
>> +                       return enc;
>> +
>> +       /* Don't encode to the default encoding */
>> +       if (!strcasecmp(value, default_encoding))
>> +               return NULL;
> 
> You could handle this easy-to-detect case before attempting the more
> expensive loop just above. (Not necessarily worth a re-roll.)

True, but I delete the loop anyways when I removed the "encoding 
linked list" as suggested by your comment above.


>> +       enc = xcalloc(1, sizeof(*enc));
>> +       /*
>> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
>> +        * simplify subsequent string comparisons.
>> +        */
>> +       enc->name = xstrdup_toupper(value);
>> +       *encoding_tail = enc;
>> +       encoding_tail = &(enc->next);
>> +
>> +       return enc;
>> +}
>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,226 @@
>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> +       git cat-file -p :test.utf16 >test.utf16.git &&
>> +       test_cmp_bin test.utf8.raw test.utf16.git &&
>> +
>> +       # cleanup
>> +       rm test.utf16.git
>> +'
> 
> This cleanup won't take place if test_cmp_bin() fails. Instead,
> cleanups are typically handled by test_when_finished() to ensure that
> the cleanup action occurs even if the test fails.
> 
>    test_expect_success 'ensure UTF-8 is stored in Git' '
>        test_when_finished "rm -f  test.utf16.git" &&
>        ...
> 
> Same comment for remaining tests.

Agreed! That's the proper way to cleanup the tests. I'll fix my tests.


>> +test_expect_success 'check prohibited UTF BOM' '
>> +       printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
>> +       printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
>> +       printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
>> +       printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
>> +
>> +       printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
>> +       printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
>> +       printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
>> +       printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
> 
> These resources seem to be used by multiple tests. Perhaps create them
> in a distinct "setup" test instead of bundling them in this test?

Good idea!


>> +       echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
>> +       echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
>> +       echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
>> +       echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
>> +
>> +       # Here we add a UTF-16 files with BOM (big-endian and little-endian)
>> +       # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
>> +       # the BOM is prohibited.
>> +       cp bebom.utf16be.raw bebom.utf16be &&
>> +       test_must_fail git add bebom.utf16be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> +       cp lebom.utf16le.raw lebom.utf16be &&
>> +       test_must_fail git add lebom.utf16be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> +       cp bebom.utf16be.raw bebom.utf16le &&
>> +       test_must_fail git add bebom.utf16le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> +       cp lebom.utf16le.raw lebom.utf16le &&
>> +       test_must_fail git add lebom.utf16le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> +       # ... and the same for UTF-32
>> +       cp bebom.utf32be.raw bebom.utf32be &&
>> +       test_must_fail git add bebom.utf32be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> +       cp lebom.utf32le.raw lebom.utf32be &&
>> +       test_must_fail git add lebom.utf32be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> +       cp bebom.utf32be.raw bebom.utf32le &&
>> +       test_must_fail git add bebom.utf32le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> +       cp lebom.utf32le.raw lebom.utf32le &&
>> +       test_must_fail git add lebom.utf32le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> +       # cleanup
>> +       git reset --hard HEAD
>> +'
> 
> Clumping all these checks into the same test makes it difficult to
> determine which one failed (if one does fail). Also, the amount of
> copy/paste code is easy to get wrong and a bit onerous to review.
> Automating via for-loops would address these concerns. For instance,
> to check all combinations (not just 8 combinations tested above):
> 
>    for i in 16be 16le 32be 32le
>    do
>        for j in 16be 16le 32be 32le
>        do
>            test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
>                test_when_finished "git reset --hard HEAD" &&
>                cp bebom.utf$i.raw bebom.utf$j &&
>                test_must_fail git add bebom.utf$j 2>err.out &&
>                J=$(echo $j | tr bel BEL) &&
>                test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
>            '
>        done
>    done
> 
> Alternately, the test could be moved to a function and called for each
> combination:
> 
>    check_prohibited_bom () {
>        i=$1
>        j=$2
>        test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
>            ...
>        '
>    }
>    check_prohibited_bom 16be 16be
>    check_prohibited_bom 16be 16le
>    ...
> 
> Same comment regarding copy/paste in tests below.

In general I share your concern about code duplication.
However, I don't want be "too clever" in testing code
as it becomes hard to read it. Therefore, I implemented
a loop over "16 32" which is a good compromise I think.

 
Thanks you very much for this thorough review,
Lars

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-27 11:16     ` Lars Schneider
@ 2018-02-28 21:20       ` Eric Sunshine
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Sunshine @ 2018-02-28 21:20 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin

On Tue, Feb 27, 2018 at 6:16 AM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>> On 25 Feb 2018, at 08:15, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
>> The above paragraph is giving an example of the scenario described in
>> the paragraph above it. It might, therefore, be clearer to start this
>> paragraph with "For example,...". Also, as an aid to non-Windows
>> developers, it might be helpful to introduce ".proj" files as
>> "Microsoft Visual Studio `*.proj` files...".
>
> Agreed. How about this?
>
>   For example, Microsoft Visual Studio resources files (`*.rc`) or
>   PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
>   If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
>   a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
>   stored as UTF-8 internally. A client without `working-tree-encoding`
>   support will checkout `foo.ps1` as UTF-8 encoded file. This will
>   typically cause trouble for the users of this file.

Since "*.proj" is mentioned a number of times in paragraphs below this
one, perhaps just stick with it instead of introducing "*.rc" and
"*.ps1", which don't necessarily add value. Alternately, if you use
.rc and .ps1, then change the remaining .proj references to follow
suit.

>>> diff --git a/convert.c b/convert.c
>>> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>>> +               const char *error_msg = _(
>>> +                       "BOM is required in '%s' if encoded as %s");
>>> +               const char *advise_msg = _(
>>> +                       "The file '%s' is missing a byte order mark (BOM). "
>>> +                       "Please use %sBE or %sLE (depending on the byte order) "
>>> +                       "as working-tree-encoding.");
>>> +               advise(advise_msg, path, enc->name, enc->name);
>>> +               if (conv_flags & CONV_WRITE_OBJECT)
>>> +                       die(error_msg, path, enc->name);
>>> +               else
>>> +                       error(error_msg, path, enc->name);
>>> +       }
>>
>> For a function which presumably is agnostic about the working-tree
>> encoding (supporting whatever iconv does), it nevertheless has
>> unusually intimate knowledge about UTF BOMs, in general, and the
>> internals of has_prohibited_utf_bom() and
>> is_missing_required_utf_bom(), in particular. It might be cleaner, and
>> would simplify this function, if all this UTF BOM-specific gunk was
>> moved elsewhere and generalized into a "validate conversion" function.
>
> Agreed! How about this?
>
>         /*
>          * If we convert to an UTF encoding, then check for common BOM errors.
>          */
>         if (!memcmp("UTF-", enc, 4))
>                 if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
>                         return 0;

Better. The comment merely repeats the code, which is clear in its
intent, thus doesn't really add value (but that's a minor point).

I'd probably generalize it further to a "validate conversion" function
(which itself would have this UTF-specific knowledge), but that's a
matter of taste and can be done later when/if other types of
validation requirements arise.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-27  5:17       ` Eric Sunshine
@ 2018-02-28 21:34         ` Lars Schneider
  0 siblings, 0 replies; 20+ messages in thread
From: Lars Schneider @ 2018-02-28 21:34 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 27 Feb 2018, at 06:17, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sun, Feb 25, 2018 at 6:35 AM, Lars Schneider
> <larsxschneider@gmail.com> wrote:
>>> On 25 Feb 2018, at 04:41, Eric Sunshine <sunshine@sunshineco.com> wrote:
>>> Is this interpretation correct? When I read [1], I interpret it as
>>> saying that no BOM _of any sort_ should be present when the encoding
>>> is declared as one of UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.
>> 
>> Correct!
>> 
>>> This code, on the other hand, only checks for BOMs corresponding
>>> to the declared size (16 or 32 bits).
>> 
>> Hmm. Interesting thought. You are saying that my code won't complain if
>> a document declared as UTF-16LE has a UTF32-LE BOM, correct?
> 
> Well, not specifically that case since UTF-16LE BOM is a subset of UTF32-LE BOM.

Correct - bad example on my part!


> My observation was more general in that [1] seems to say that there
> should be _no_ BOM whatsoever if one of UTF-16BE, UTF-16LE, UTF-32BE,
> or UTF-32LE is declared.

You are saying that a document declared as UTF-16LE must not start 
with 0000feff (UTF-32BE BOM)? I interpreted that situation as a "feff"
in the middle of a file and therefore the BOM should be treated as
ZWNBSP as explained here: http://unicode.org/faq/utf_bom.html#bom6

Plus, if "_no_ BOM whatsoever" is allowed then wouldn't we need to check
for UTF-1, UTF-7, and UTF-8 BOM's too?

I dunno.


>> I would say
>> this is correct behavior in context of this function. This function assumes
>> that the document is proper UTF-16/UTF-16BE/UTF-16LE but it is wrongly
>> declared with respect to its BOM in the .gitattributes. Would this
>> comment make it more clear to you?
>>        /*
>>         * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
>>         * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
>>         * The function returns true if this rule is violated.
>>         *
>>         * [1] http://unicode.org/faq/utf_bom.html#bom10
>>         */
>> I think what you are referring to is a different class of error and
>> would therefore warrant its own checker function. Would you agree?
> 
> I don't understand to what different class of error you refer. The
> FAQ[1] seems pretty clear to me in that if one of those declarations
> is used explicitly, then there should be _no_ BOM, period. It doesn't
> say anything about allowing a BOM for a differently-sized encoding (16
> vs 32).
> 
> If I squint very hard, I _guess_ I can see how you interpret [1] with
> the more narrow meaning of the restriction applying only to a BOM of
> the same size as the declared encoding, though reading it that way
> doesn't come easily to me.

For me it is somewhat the other way around :-)
Since I am not sure what is right, I decided to ask the Internet:
https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream

Let's see if someone has a good answer.

- Lars


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-02-25 19:50   ` Eric Sunshine
@ 2018-03-04 19:08     ` Lars Schneider
  2018-03-04 19:58       ` Eric Sunshine
  0 siblings, 1 reply; 20+ messages in thread
From: Lars Schneider @ 2018-03-04 19:08 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 25 Feb 2018, at 20:50, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sat, Feb 24, 2018 at 11:28 AM,  <lars.schneider@autodesk.com> wrote:
>> UTF supports lossless conversion round tripping and conversions between
>> UTF and other encodings are mostly round trip safe as Unicode aims to be
>> a superset of all other character encodings. However, certain encodings
>> (e.g. SHIFT-JIS) are known to have round trip issues [1].
>> 
>> Add 'core.checkRoundtripEncoding', which contains a comma separated
>> list of encodings, to define for what encodings Git should check the
>> conversion round trip if they are used in the 'working-tree-encoding'
>> attribute.
>> 
>> Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.
>> 
>> [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
>> +static int check_roundtrip(const char* enc_name)
>> +{
>> +       /*
>> +        * check_roundtrip_encoding contains a string of space and/or
>> +        * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
>> +        * Search for the given encoding in that string.
>> +        */
>> +       const char *found = strcasestr(check_roundtrip_encoding, enc_name);
>> +       const char *next = found + strlen(enc_name);
> 
> Is this pointer arithmetic undefined behavior (according to the C
> standard) if NULL is returned by strcasestr()? If so, how comfortable
> are we with this? Perhaps if you add an 'if' into the mix, you can
> avoid it:
> 
>    if (found) {
>        const char *next = found + strlen(enc_name);
>        return ...long complicated expression...;
>    }
>    return false;

OK. I've fixed it this way:

	if (!found)
		return 0;


[...]
>> +
>> +               if (!re_src || src_len != re_src_len ||
>> +                   memcmp(src, re_src, src_len)) {
>> +                       const char* msg = _("encoding '%s' from %s to %s and "
>> +                                           "back is not the same");
>> +                       die(msg, path, enc->name, default_encoding);
> 
> Last two arguments need to be swapped.

Hm. Are you sure? I think it is correct as it is. We are in encode_to_git()
here and that means we encode *to* "default encoding", no?


>> +               }
>> +
>> +               free(re_src);
>> +       }
>> +
>>        strbuf_attach(buf, dst, dst_len, dst_len + 1);
>>        return 1;
>> }
>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -225,4 +225,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
>> +test_expect_success 'check roundtrip encoding' '
>> +       text="hallo there!\nroundtrip test here!" &&
>> +       printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
>> +       printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
>> +       echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>> +
>> +       # SHIFT-JIS encoded files are round-trip checked by default...
>> +       GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
>> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&
> 
> Why redirect to /dev/null? If something does go wrong somewhere, the
> more output available for debugging the problem, the better, so
> throwing it away unnecessarily seems contraindicated.

OK!


>> +       git reset &&
>> +
>> +       # ... unless we overwrite the Git config!
>> +       test_config core.checkRoundtripEncoding "garbage" &&
>> +       ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
>> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&
>> +       test_unconfig core.checkRoundtripEncoding &&
> 
> The "unconfig" won't take place if the test fails. Instead of
> test_config/test_unconfig, you could use '-c' to set the config
> transiently for the git-add operation:
> 
>    ! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage add ...

Agreed. Although test_config (in t/test-lib-functions.sh) automatically 
unsets itself after the test is over. 


>> +       git reset &&
>> +
>> +       # UTF-16 encoded files should not be round-trip checked by default...
>> +       ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
>> +               grep "Checking roundtrip encoding for UTF-16" &&
>> +       git reset &&
>> +
>> +       # ... unless we tell Git to check it!
>> +       test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
>> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
>> +               grep "Checking roundtrip encoding for UTF-16" &&
>> +       git reset &&
>> +
>> +       # ... unless we tell Git to check it!
>> +       # (here we also check that the casing of the encoding is irrelevant)
>> +       test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
>> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
>> +               grep "Checking roundtrip encoding for UTF-16" &&
>> +       git reset &&
>> +
>> +       # cleanup
>> +       rm roundtrip.shift roundtrip.utf16 &&
>> +       git reset --hard HEAD
>> +'
> 
> Same comment as for patch 5/7: This cleanup won't happen if the test
> fails. Instead, use test_when_finished() earlier in the test:
> 
>    test_expect_success 'check roundtrip encoding' '
>        test_when_finished "rm -f roundtrip.shift roundtrip.utf16; git
> reset --hard HEAD" &&
>        ...

Agreed!

Thanks you,
Lars


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-04 19:08     ` Lars Schneider
@ 2018-03-04 19:58       ` Eric Sunshine
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Sunshine @ 2018-03-04 19:58 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin

On Sun, Mar 4, 2018 at 2:08 PM, Lars Schneider <larsxschneider@gmail.com> wrote:
>> On 25 Feb 2018, at 20:50, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> On Sat, Feb 24, 2018 at 11:28 AM,  <lars.schneider@autodesk.com> wrote:
>>> +               if (!re_src || src_len != re_src_len ||
>>> +                   memcmp(src, re_src, src_len)) {
>>> +                       const char* msg = _("encoding '%s' from %s to %s and "
>>> +                                           "back is not the same");
>>> +                       die(msg, path, enc->name, default_encoding);
>>
>> Last two arguments need to be swapped.
>
> Hm. Are you sure? I think it is correct as it is. We are in encode_to_git()
> here and that means we encode *to* "default encoding", no?

Okay. I guess I was just looking at the most recent
reencode_string_len() -- and maybe overlooked the "and back" -- and
was thinking that this error message applied directly to it, but I see
your point about the error saying something about encode_to_git()
overall, in which case I agree with you.

>>> +       test_config core.checkRoundtripEncoding "garbage" &&
>>> +       ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
>>> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&
>>> +       test_unconfig core.checkRoundtripEncoding &&
>>
>> The "unconfig" won't take place if the test fails. Instead of
>> test_config/test_unconfig, you could use '-c' to set the config
>> transiently for the git-add operation:
>>
>>    ! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage add ...
>
> Agreed. Although test_config (in t/test-lib-functions.sh) automatically
> unsets itself after the test is over.

Yep, so you could get by with that alone. The test_unconfig() simply
isn't needed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-03-04 19:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-25  3:41   ` Eric Sunshine
2018-02-25 11:35     ` Lars Schneider
2018-02-27  5:17       ` Eric Sunshine
2018-02-28 21:34         ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-25  3:52   ` Eric Sunshine
2018-02-25 11:41     ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-25  7:15   ` Eric Sunshine
2018-02-27 11:16     ` Lars Schneider
2018-02-28 21:20       ` Eric Sunshine
2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-25 19:50   ` Eric Sunshine
2018-03-04 19:08     ` Lars Schneider
2018-03-04 19:58       ` Eric Sunshine

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).