git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v4 0/6]  convert: add support for different encodings
@ 2018-01-20 15:24 lars.schneider
  2018-01-20 15:24 ` [PATCH v4 1/6] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-4 and 6 are preparation and helper functions.
Patch 5 is the actual change.

This series depends on Torsten's "convert_to_git(): safe_crlf/checksafe
becomes int conv_flags" patch:
https://public-inbox.org/git/20180113224931.27031-1-tboegi@web.de/

Changes since v3:

* I renamed the attribute from "checkout-encoding" to "working-tree-encoding"
  in the hope to convey better what the attribute is about.

* I rebased the series to Git 2.16 and removed Torsten's patch as he
  posted the patch on his own.

* Fix documentation wording. (Torsten)

* A macro was used in a commit before it's introduction. Fixed!(Junio)

Thanks,
Lars

   RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
    v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
    v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
    v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/21f4dac5ab
Checkout: git fetch https://github.com/larsxschneider/git encoding-v4 && git checkout 21f4dac5ab


### Interdiff (v3-rebased-2.16..v4):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 1bc03e69cb..a8dbf4be30 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,8 +272,8 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.


-`checkout-encoding`
-^^^^^^^^^^^^^^^^^^^
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^

 Git recognizes files encoded with ASCII or one of its supersets (e.g.
 UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
@@ -281,17 +281,17 @@ interpreted as binary and consequently built-in Git text processing
 tools (e.g. 'git diff') as well as most Git web front ends do not
 visualize the content.

-In these cases you can teach Git the encoding of a file in the working
-directory with the `checkout-encoding` attribute. If a file with this
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
 attributes is added to Git, then Git reencodes the content from the
 specified encoding to UTF-8 and stores the result in its internal data
 structure (called "the index"). On checkout the content is encoded
 back to the specified encoding.

-Please note that using the `checkout-encoding` attribute may have a
+Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Git clients that do not support the `checkout-encoding` attribute
+- Git clients that do not support the `working-tree-encoding` attribute
   will checkout the respective files UTF-8 encoded and not in the
   expected encoding. Consequently, these files will appear different
   which typically causes trouble. This is in particular the case for
@@ -304,7 +304,7 @@ number of pitfalls:
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').

-Use the `checkout-encoding` attribute only if you cannot store a file in
+Use the `working-tree-encoding` attribute only if you cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.

@@ -313,7 +313,7 @@ with byte order mark (BOM) and you want Git to perform automatic line
 ending conversion based on your platform.

 ------------------------
-*.txt    text checkout-encoding=UTF-16
+*.txt    text working-tree-encoding=UTF-16
 ------------------------

 Use the following attributes if your '*.txt' files are UTF-16 little
@@ -321,7 +321,7 @@ endian encoded without BOM and you want Git to use Windows line endings
 in the working directory.

 ------------------------
-*.txt    checkout-encoding=UTF-16LE text eol=CRLF
+*.txt    working-tree-encoding=UTF-16LE text eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
diff --git a/convert.c b/convert.c
index 8559651b3f..13fad490ce 100644
--- a/convert.c
+++ b/convert.c
@@ -323,7 +323,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
    const char *advise_msg = _(
      "You told Git to treat '%s' as %s. A byte order mark "
      "(BOM) is prohibited with this encoding. Either use "
-     "%.6s as checkout encoding or remove the BOM from the "
+     "%.6s as working tree encoding or remove the BOM from the "
      "file.");

    advise(advise_msg, path, enc->name, enc->name, enc->name);
@@ -339,7 +339,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
    const char *advise_msg = _(
      "You told Git to treat '%s' as %s. A byte order mark "
      "(BOM) is required with this encoding. Either use "
-     "%sBE/%sLE as checkout encoding or add a BOM to the "
+     "%sBE/%sLE as working tree encoding or add a BOM to the "
      "file.");
    advise(advise_msg, path, enc->name, enc->name, enc->name);
    if (conv_flags & CONV_WRITE_OBJECT)
@@ -1237,7 +1237,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)

  if (!check) {
    check = attr_check_initl("crlf", "ident", "filter",
-          "eol", "text", "checkout-encoding",
+          "eol", "text", "working-tree-encoding",
           NULL);
    user_convert_tail = &user_convert;
    encoding_tail = &encoding;
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-working-tree-encoding.sh
similarity index 89%
rename from t/t0028-checkout-encoding.sh
rename to t/t0028-working-tree-encoding.sh
index 5f1c911c07..0f36d4990a 100755
--- a/t/t0028-checkout-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -1,6 +1,6 @@
 #!/bin/sh

-test_description='checkout-encoding conversion via gitattributes'
+test_description='working-tree-encoding conversion via gitattributes'

 . ./test-lib.sh

@@ -10,7 +10,7 @@ test_expect_success 'setup test repo' '
  git config core.eol lf &&

  text="hallo there!\ncan you read me?" &&
- echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
+ echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
  printf "$text" >test.utf8.raw &&
  printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
  cp test.utf16.raw test.utf16 &&
@@ -45,10 +45,10 @@ test_expect_success 'check prohibited UTF BOM' '
  printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
  printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&

- echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
- echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
- echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
- echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
+ echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+ echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+ echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+ echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&

  # Here we add a UTF-16 files with BOM (big-endian and little-endian)
  # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
@@ -91,7 +91,7 @@ test_expect_success 'check prohibited UTF BOM' '
 '

 test_expect_success 'check required UTF BOM' '
- echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
+ echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&

  cp nobom.utf16be.raw nobom.utf16 &&
  test_must_fail git add nobom.utf16 2>err.out &&
@@ -143,11 +143,11 @@ test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '

 test_expect_success 'check unsupported encodings' '

- echo "*.nothing text checkout-encoding=" >>.gitattributes &&
+ echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
  printf "nothing" >t.nothing &&
  git add t.nothing &&

- echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
+ echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
  printf "garbage" >t.garbage &&
  test_must_fail git add t.garbage 2>err.out &&
  test_i18ngrep "fatal: failed to encode" err.out &&
@@ -161,7 +161,7 @@ test_expect_success 'error if encoding round trip is not the same during refresh
  BEFORE_STATE=$(git rev-parse HEAD) &&

  # Skip the UTF-16 filter for the added file
- # This simulates a Git version that has no checkoutEncoding support
+ # This simulates a Git version that has no working tree encoding support
  echo "hallo" >nonsense.utf16 &&
  TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
  git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&



### Patches

Lars Schneider (6):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute

 Documentation/gitattributes.txt  |  60 +++++++++++
 convert.c                        | 218 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 ++-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 198 +++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 +++++++
 utf8.h                           |  25 +++++
 9 files changed, 552 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 1/6] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-20 15:24 ` [PATCH v4 2/6] strbuf: add xstrdup_toupper() lars.schneider
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 2/6] strbuf: add xstrdup_toupper()
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
  2018-01-20 15:24 ` [PATCH v4 1/6] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-20 15:24 ` [PATCH v4 3/6] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 3/6] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
  2018-01-20 15:24 ` [PATCH v4 1/6] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-01-20 15:24 ` [PATCH v4 2/6] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-20 15:24 ` [PATCH v4 4/6] utf8: add function to detect a missing " lars.schneider
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 4/6] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-01-20 15:24 ` [PATCH v4 3/6] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..f033fec1c2 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-01-20 15:24 ` [PATCH v4 4/6] utf8: add function to detect a missing " lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-21 14:22   ` Simon Ruderich
                     ` (2 more replies)
  2018-01-20 15:24 ` [PATCH v4 6/6] " lars.schneider
  2018-01-23 20:19 ` [PATCH v4 0/6] convert: add support for different encodings Torsten Bögershausen
  6 siblings, 3 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  60 ++++++++++++
 convert.c                        | 190 ++++++++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 196 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..a8dbf4be30 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		working-tree-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..0c372069b1 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as working tree encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as working tree encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	/*
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (conv_flags & CONV_WRITE_OBJECT)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1120,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1200,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1209,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1234,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1315,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1345,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1377,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1849,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.checkout_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..4d85b42776
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,196 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no working tree encoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v4 6/6] convert: add tracing for 'working-tree-encoding' attribute
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-01-20 15:24 ` lars.schneider
  2018-01-23 20:19 ` [PATCH v4 0/6] convert: add support for different encodings Torsten Bögershausen
  6 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-20 15:24 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 28 ++++++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/convert.c b/convert.c
index 0c372069b1..13fad490ce 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	/*
 	 * UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -365,6 +390,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 					     enc->name, default_encoding,
 					     &re_src_len);
 
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
 		if (!re_src || src_len != re_src_len ||
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 4d85b42776..0f36d4990a 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-01-21 14:22   ` Simon Ruderich
  2018-01-22 12:35     ` Lars Schneider
  2018-01-22 18:00   ` SQUASH convert: add tracing for " lars.schneider
  2018-01-23 10:17   ` [PATCH v2] " lars.schneider
  2 siblings, 1 reply; 43+ messages in thread
From: Simon Ruderich @ 2018-01-21 14:22 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

On Sat, Jan 20, 2018 at 04:24:17PM +0100, lars.schneider@autodesk.com wrote:
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +	const char *value = check->value;
> +	struct encoding *enc;
> +
> +	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +	    !strlen(value))
> +		return NULL;
> +
> +	for (enc = encoding; enc; enc = enc->next)
> +		if (!strcasecmp(value, enc->name))
> +			return enc;
> +
> +	/* Don't encode to the default encoding */
> +	if (!strcasecmp(value, default_encoding))
> +		return NULL;
> +
> +	enc = xcalloc(1, sizeof(struct convert_driver));

I think this should be "sizeof(struct encoding)" but I prefer
"sizeof(*enc)" which prevents these kind of mistakes.

> +	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */

"aways" -> "always" and I think the comment should say why
uppercase is important.

> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +	git cat-file -p :test.utf16 >test.utf16.git &&
> +	test_cmp_bin test.utf8.raw test.utf16.git &&
> +	rm test.utf8.raw test.utf16.git
> +'
> +
> +test_expect_success 're-encode to UTF-16 on checkout' '
> +	rm test.utf16 &&
> +	git checkout test.utf16 &&
> +	test_cmp_bin test.utf16.raw test.utf16 &&
> +
> +	# cleanup
> +	rm test.utf16.raw

Micro-nit: For consistency with the previous test, remove the
empty line and comment (or just keep the files generated from the
"setup test repo" phase and don't explicitly delete them)?

Regards
Simon
-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-21 14:22   ` Simon Ruderich
@ 2018-01-22 12:35     ` Lars Schneider
  2018-01-23  0:54       ` Jeff King
  2018-01-23  9:27       ` Simon Ruderich
  0 siblings, 2 replies; 43+ messages in thread
From: Lars Schneider @ 2018-01-22 12:35 UTC (permalink / raw)
  To: Simon Ruderich
  Cc: lars.schneider, git, gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 21 Jan 2018, at 15:22, Simon Ruderich <simon@ruderich.org> wrote:
> 
> On Sat, Jan 20, 2018 at 04:24:17PM +0100, lars.schneider@autodesk.com wrote:
>> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +	const char *value = check->value;
>> +	struct encoding *enc;
>> +
>> +	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +	    !strlen(value))
>> +		return NULL;
>> +
>> +	for (enc = encoding; enc; enc = enc->next)
>> +		if (!strcasecmp(value, enc->name))
>> +			return enc;
>> +
>> +	/* Don't encode to the default encoding */
>> +	if (!strcasecmp(value, default_encoding))
>> +		return NULL;
>> +
>> +	enc = xcalloc(1, sizeof(struct convert_driver));
> 
> I think this should be "sizeof(struct encoding)" but I prefer
> "sizeof(*enc)" which prevents these kind of mistakes.

Great catch! Thank you!

Other code paths are at risk of this problem too. Consider this:

$ git grep 'sizeof(\*' | wc -l
     303
$ git grep 'sizeof(struct ' | wc -l
     208

E.g. even in the same file (likely where I got the code from):
https://github.com/git/git/blob/59c276cf4da0705064c32c9dba54baefa282ea55/convert.c#L780

@Junio: 
Would you welcome a patch that replaces "struct foo" with "*foo"
if applicable?


>> +	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
> 
> "aways" -> "always" and I think the comment should say why
> uppercase is important.

Would that be better?

	/* Aways use upper case names to simplify subsequent string comparison. */
	enc->name = xstrdup_toupper(value);

AFAIK uppercase and lowercase names are both valid. I just wanted to
ensure that we use one consistent casing. That reads better in error messages
and I don't need to check for the letter case in has_prohibited_utf_bom()
and friends in utf8.c


>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> +	git cat-file -p :test.utf16 >test.utf16.git &&
>> +	test_cmp_bin test.utf8.raw test.utf16.git &&
>> +	rm test.utf8.raw test.utf16.git
>> +'
>> +
>> +test_expect_success 're-encode to UTF-16 on checkout' '
>> +	rm test.utf16 &&
>> +	git checkout test.utf16 &&
>> +	test_cmp_bin test.utf16.raw test.utf16 &&
>> +
>> +	# cleanup
>> +	rm test.utf16.raw
> 
> Micro-nit: For consistency with the previous test, remove the
> empty line and comment (or just keep the files generated from the
> "setup test repo" phase and don't explicitly delete them)?

I would rather add a new line and a comment to the previous test 
to be consistent.

I know we could leave the files but these lingering files could
always surprise writers of future tests (at least they surprised
me in other tests).


Thank you very much for the review,
Lars

^ permalink raw reply	[flat|nested] 43+ messages in thread

* SQUASH convert: add tracing for 'working-tree-encoding' attribute
  2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
  2018-01-21 14:22   ` Simon Ruderich
@ 2018-01-22 18:00   ` lars.schneider
  2018-01-22 19:37     ` Eric Sunshine
  2018-01-23 10:17   ` [PATCH v2] " lars.schneider
  2 siblings, 1 reply; 43+ messages in thread
From: lars.schneider @ 2018-01-22 18:00 UTC (permalink / raw)
  To: gitster
  Cc: Johannes.Schindelin, git, simon, j6t, larsxschneider, peff,
	ramsay, sunshine, tboegi

From: Lars Schneider <larsxschneider@gmail.com>

Hi Junio,

this attached patch addresses Simon's review comments.

Can you squash the patch if you apply "[PATCH v4 5/6] convert: add
'working-tree-encoding' attribute"?

https://public-inbox.org/git/20180120152418.52859-6-lars.schneider@autodesk.com/

Thanks,
Lars


 convert.c                        | 5 +++--
 t/t0028-working-tree-encoding.sh | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 13fad490ce..f5e0cfc352 100644
--- a/convert.c
+++ b/convert.c
@@ -1165,8 +1165,9 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
 	if (!strcasecmp(value, default_encoding))
 		return NULL;

-	enc = xcalloc(1, sizeof(struct convert_driver));
-	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	enc = xcalloc(1, sizeof(*enc));
+	/* Aways use upper case names to simplify subsequent string comparison. */
+	enc->name = xstrdup_toupper(value);
 	*encoding_tail = enc;
 	encoding_tail = &(enc->next);

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 0f36d4990a..a6da61280d 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -22,6 +22,8 @@ test_expect_success 'setup test repo' '
 test_expect_success 'ensure UTF-8 is stored in Git' '
 	git cat-file -p :test.utf16 >test.utf16.git &&
 	test_cmp_bin test.utf8.raw test.utf16.git &&
+
+	# cleanup
 	rm test.utf8.raw test.utf16.git
 '


base-commit: 21f4dac5aba07a6109285c57a0478bf502e09009
--
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: SQUASH convert: add tracing for 'working-tree-encoding' attribute
  2018-01-22 18:00   ` SQUASH convert: add tracing for " lars.schneider
@ 2018-01-22 19:37     ` Eric Sunshine
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Sunshine @ 2018-01-22 19:37 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, Johannes Schindelin, Git List, simon,
	Johannes Sixt, Lars Schneider, Jeff King, Ramsay Jones,
	Torsten Bögershausen

On Mon, Jan 22, 2018 at 1:00 PM,  <lars.schneider@autodesk.com> wrote:
> diff --git a/convert.c b/convert.c
> @@ -1165,8 +1165,9 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> -       enc = xcalloc(1, sizeof(struct convert_driver));
> -       enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
> +       enc = xcalloc(1, sizeof(*enc));
> +       /* Aways use upper case names to simplify subsequent string comparison. */

s/Aways/Always/

https://public-inbox.org/git/20180121142222.GA10248@ruderich.org/

> +       enc->name = xstrdup_toupper(value);
>         *encoding_tail = enc;
>         encoding_tail = &(enc->next);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-22 12:35     ` Lars Schneider
@ 2018-01-23  0:54       ` Jeff King
  2018-01-23 10:25         ` Simon Ruderich
  2018-01-23  9:27       ` Simon Ruderich
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff King @ 2018-01-23  0:54 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Simon Ruderich, lars.schneider, git, gitster, tboegi, j6t,
	sunshine, ramsay, Johannes.Schindelin

On Mon, Jan 22, 2018 at 01:35:25PM +0100, Lars Schneider wrote:

> >> +	enc = xcalloc(1, sizeof(struct convert_driver));
> > 
> > I think this should be "sizeof(struct encoding)" but I prefer
> > "sizeof(*enc)" which prevents these kind of mistakes.
> 
> Great catch! Thank you!
> 
> Other code paths are at risk of this problem too. Consider this:
> 
> $ git grep 'sizeof(\*' | wc -l
>      303
> $ git grep 'sizeof(struct ' | wc -l
>      208
> 
> E.g. even in the same file (likely where I got the code from):
> https://github.com/git/git/blob/59c276cf4da0705064c32c9dba54baefa282ea55/convert.c#L780
> 
> @Junio: 
> Would you welcome a patch that replaces "struct foo" with "*foo"
> if applicable?

This is part of the reason we've been moving to helpers like
ALLOC_ARRAY(), which make it harder to get this wrong.

We don't have an ALLOC_OBJECT(), which is what you would want here. I'm
not sure if that is helpful or crossing the line of "you're obscuring it
to the point that people familiar with C have trouble reading the code".
The ALLOC_ARRAY() macros have been sort of an experiment there (I tend
to like them, but I also work with Git's code often enough that I am not
likely to be confused by our bespoke macros).

But anyway, that was a bit of a tangent. Certainly the smaller change is
just standardizing on sizeof(*foo), which I think most people agree on
at this point. It might be worth putting in CodingGuidelines.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-22 12:35     ` Lars Schneider
  2018-01-23  0:54       ` Jeff King
@ 2018-01-23  9:27       ` Simon Ruderich
  1 sibling, 0 replies; 43+ messages in thread
From: Simon Ruderich @ 2018-01-23  9:27 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

On Mon, Jan 22, 2018 at 01:35:25PM +0100, Lars Schneider wrote:
>>> +	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
>>
>> "aways" -> "always" and I think the comment should say why
>> uppercase is important.
>
> Would that be better?
>
> 	/* Aways use upper case names to simplify subsequent string comparison. */
> 	enc->name = xstrdup_toupper(value);
>
> AFAIK uppercase and lowercase names are both valid. I just wanted to
> ensure that we use one consistent casing. That reads better in error messages
> and I don't need to check for the letter case in has_prohibited_utf_bom()
> and friends in utf8.c

Sounds good (minus the "Aways" typo Eric already noticed).

>> Micro-nit: For consistency with the previous test, remove the
>> empty line and comment (or just keep the files generated from the
>> "setup test repo" phase and don't explicitly delete them)?
>
> I would rather add a new line and a comment to the previous test
> to be consistent.
>
> I know we could leave the files but these lingering files could
> always surprise writers of future tests (at least they surprised
> me in other tests).

Sure, that sounds good. Just noticed the inconsistency and wanted
to mention it.

Regards
Simon
-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2] SQUASH convert: add tracing for 'working-tree-encoding' attribute
  2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
  2018-01-21 14:22   ` Simon Ruderich
  2018-01-22 18:00   ` SQUASH convert: add tracing for " lars.schneider
@ 2018-01-23 10:17   ` lars.schneider
  2 siblings, 0 replies; 43+ messages in thread
From: lars.schneider @ 2018-01-23 10:17 UTC (permalink / raw)
  To: gitster
  Cc: Johannes.Schindelin, git, simon, j6t, larsxschneider, peff,
	ramsay, sunshine, tboegi

From: Lars Schneider <larsxschneider@gmail.com>

Hi Junio,

I overlooked a typo pointed out in Simon's review. Here is a new patch
for squashing. Sorry for the trouble!

@Eric: Thanks for spotting this!

Cheers,
Lars


 convert.c                        | 8 ++++++--
 t/t0028-working-tree-encoding.sh | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 13fad490ce..dce2f6e201 100644
--- a/convert.c
+++ b/convert.c
@@ -1165,8 +1165,12 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
 	if (!strcasecmp(value, default_encoding))
 		return NULL;

-	enc = xcalloc(1, sizeof(struct convert_driver));
-	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	enc = xcalloc(1, sizeof(*enc));
+	/*
+	 * Ensure encoding names are always upper case (e.g. UTF-8) to
+	 * simplify subsequent string comparisons.
+	 */
+	enc->name = xstrdup_toupper(value);
 	*encoding_tail = enc;
 	encoding_tail = &(enc->next);

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 0f36d4990a..a6da61280d 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -22,6 +22,8 @@ test_expect_success 'setup test repo' '
 test_expect_success 'ensure UTF-8 is stored in Git' '
 	git cat-file -p :test.utf16 >test.utf16.git &&
 	test_cmp_bin test.utf8.raw test.utf16.git &&
+
+	# cleanup
 	rm test.utf8.raw test.utf16.git
 '


base-commit: 21f4dac5aba07a6109285c57a0478bf502e09009
--
2.16.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-23  0:54       ` Jeff King
@ 2018-01-23 10:25         ` Simon Ruderich
  2018-01-23 16:20           ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Simon Ruderich @ 2018-01-23 10:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Lars Schneider, lars.schneider, git, gitster, tboegi, j6t,
	sunshine, ramsay, Johannes.Schindelin

On Mon, Jan 22, 2018 at 07:54:01PM -0500, Jeff King wrote:
> But anyway, that was a bit of a tangent. Certainly the smaller change is
> just standardizing on sizeof(*foo), which I think most people agree on
> at this point. It might be worth putting in CodingGuidelines.

Personally I prefer sizeof(*foo) which is a well non-idiom, used
in many projects and IMHO easy to read and understand.

I've played a little with coccinelle and the following spatch
seems to catch many occurrences of sizeof(struct ..) (the first
hunk seems to expand multiple times causing conflicts in the
generated patch). Cases like a->b = xcalloc() are not matched, I
don't know enough coccinelle for that. If there's interest I
could prepare patches, but it will create quite some code churn.

Regards
Simon

@@
type T;
identifier x;
@@
- T *x = xmalloc(sizeof(T));
+ T *x = xmalloc(sizeof(*x));

@@
type T;
T *x;
@@
- x = xmalloc(sizeof(T));
+ x = xmalloc(sizeof(*x));


@@
type T;
identifier x;
expression n;
@@
- T *x = xcalloc(n, sizeof(T));
+ T *x = xcalloc(n, sizeof(*x));

@@
type T;
T *x;
expression n;
@@
- x = xcalloc(n, sizeof(T));
+ x = xcalloc(n, sizeof(*x));


@@
type T;
T x;
@@
- memset(&x, 0, sizeof(T));
+ memset(&x, 0, sizeof(x));

@@
type T;
T *x;
@@
- memset(x, 0, sizeof(T));
+ memset(x, 0, sizeof(*x));

-- 
+ privacy is necessary
+ using gnupg http://gnupg.org
+ public key id: 0x92FEFDB7E44C32F9

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-23 10:25         ` Simon Ruderich
@ 2018-01-23 16:20           ` Jeff King
  2018-01-23 21:12             ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2018-01-23 16:20 UTC (permalink / raw)
  To: Simon Ruderich
  Cc: Lars Schneider, lars.schneider, git, gitster, tboegi, j6t,
	sunshine, ramsay, Johannes.Schindelin

On Tue, Jan 23, 2018 at 11:25:58AM +0100, Simon Ruderich wrote:

> On Mon, Jan 22, 2018 at 07:54:01PM -0500, Jeff King wrote:
> > But anyway, that was a bit of a tangent. Certainly the smaller change is
> > just standardizing on sizeof(*foo), which I think most people agree on
> > at this point. It might be worth putting in CodingGuidelines.
> 
> Personally I prefer sizeof(*foo) which is a well non-idiom, used
> in many projects and IMHO easy to read and understand.

Me too.

> I've played a little with coccinelle and the following spatch
> seems to catch many occurrences of sizeof(struct ..) (the first
> hunk seems to expand multiple times causing conflicts in the
> generated patch). Cases like a->b = xcalloc() are not matched, I
> don't know enough coccinelle for that. If there's interest I
> could prepare patches, but it will create quite some code churn.

Yeah, I'm not sure what our current policy is there. Traditionally our
strategy was not to churn code, but to update old idioms as they were
touched. Especially if the change was not urgent, but mostly stylistic
(which this one is).

But with Coccinelle, it's a lot easier to apply the change tree-wide, and
to convert topics in flight as they get merged. The maintainer still
gets conflicts with topics-in-flight that touch converted areas, though.
So I'd be curious to hear if Junio's opinion has changed at all.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 0/6]  convert: add support for different encodings
  2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-01-20 15:24 ` [PATCH v4 6/6] " lars.schneider
@ 2018-01-23 20:19 ` Torsten Bögershausen
  2018-01-23 20:53   ` Junio C Hamano
  6 siblings, 1 reply; 43+ messages in thread
From: Torsten Bögershausen @ 2018-01-23 20:19 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, gitster, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

On Sat, Jan 20, 2018 at 04:24:12PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Hi,
> 
> Patches 1-4 and 6 are preparation and helper functions.
> Patch 5 is the actual change.

I (still) have 2 remarks on convert.c - to make live easier,
I will send a small "on top" patch the next days.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 0/6]  convert: add support for different encodings
  2018-01-23 20:19 ` [PATCH v4 0/6] convert: add support for different encodings Torsten Bögershausen
@ 2018-01-23 20:53   ` Junio C Hamano
  2018-01-29 20:18     ` [PATCH v5 0/7] " tboegi
                       ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-23 20:53 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

Torsten Bögershausen <tboegi@web.de> writes:

> On Sat, Jan 20, 2018 at 04:24:12PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Hi,
>> 
>> Patches 1-4 and 6 are preparation and helper functions.
>> Patch 5 is the actual change.
>
> I (still) have 2 remarks on convert.c - to make live easier,
> I will send a small "on top" patch the next days.

Thanks, both.  I'll stay on the sideline ;-) and deal with other
topics first.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute
  2018-01-23 16:20           ` Jeff King
@ 2018-01-23 21:12             ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-23 21:12 UTC (permalink / raw)
  To: Jeff King
  Cc: Simon Ruderich, Lars Schneider, lars.schneider, git, tboegi, j6t,
	sunshine, ramsay, Johannes.Schindelin

Jeff King <peff@peff.net> writes:

> But with Coccinelle, it's a lot easier to apply the change tree-wide, and
> to convert topics in flight as they get merged. The maintainer still
> gets conflicts with topics-in-flight that touch converted areas, though.
> So I'd be curious to hear if Junio's opinion has changed at all.

There are two distinct kinds of cost on such a tree-wide change.
Conflicts with in-flight topic cannot be avoided other than truly
avoiding, i.e. refraining from touching the areas in flux, but it is
primarily what the maintainer does, and with help with rerere it can
be reasonably well automated ;-)

But the cost of reviewing could become a lot smaller when our tools
are trustworthy.  As long as we can be reasonably certain that the
tree-wide change patch does one thing it is intended to do and
nothing else (e.g. comes with mechanical reproduction recipe that
allows the patch to be independently audited), I do not have much
problem with such a clean-up.

The "avoid tree-wide change" rule still applies for things that
allows a lot of subjective judgment and discretion.  I do not know
of a good way to reduce reviewer costs on those kind of changes.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v5 0/7]  convert: add support for different encodings
  2018-01-23 20:53   ` Junio C Hamano
@ 2018-01-29 20:18     ` tboegi
  2018-01-29 20:18     ` [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() tboegi
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:18 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Take V4 from Lars, manually integrated the V2 squash patch,
so a review would be good.

Add my "comments" as a patch, see 7/7 (and this is more like an RFC)

This needs to go on top of tb/crlf-conv-flags


Lars Schneider (6):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute

Torsten Bögershausen (1):
  Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

 Documentation/gitattributes.txt  |  63 +++++++++++
 convert.c                        | 233 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 ++-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 198 +++++++++++++++++++++++++++++++++
 utf8.c                           |  37 +++++++
 utf8.h                           |  25 +++++
 9 files changed, 567 insertions(+), 6 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-01-23 20:53   ` Junio C Hamano
  2018-01-29 20:18     ` [PATCH v5 0/7] " tboegi
@ 2018-01-29 20:18     ` tboegi
  2018-01-29 20:19     ` [PATCH v5 2/7] strbuf: add xstrdup_toupper() tboegi
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:18 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 8007be8fb..490f7850e 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v5 2/7] strbuf: add xstrdup_toupper()
  2018-01-23 20:53   ` Junio C Hamano
  2018-01-29 20:18     ` [PATCH v5 0/7] " tboegi
  2018-01-29 20:18     ` [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-29 20:19     ` [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 490f7850e..a20af696b 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d6..df7ced53e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-01-23 20:53   ` Junio C Hamano
                       ` (2 preceding siblings ...)
  2018-01-29 20:19     ` [PATCH v5 2/7] strbuf: add xstrdup_toupper() tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-29 20:19     ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce013..914881cd1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a8..4711429af 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-23 20:53   ` Junio C Hamano
                       ` (3 preceding siblings ...)
  2018-01-29 20:19     ` [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-30 19:15       ` Junio C Hamano
  2018-01-29 20:19     ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1..f033fec1c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af..26b5e9185 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-23 20:53   ` Junio C Hamano
                       ` (4 preceding siblings ...)
  2018-01-29 20:19     ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-30 20:05       ` Junio C Hamano
  2018-01-29 20:19     ` [PATCH v5 6/7] convert: add tracing for " tboegi
  2018-01-29 20:19     ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
  7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 Documentation/gitattributes.txt  |  60 ++++++++++++
 convert.c                        | 190 ++++++++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 196 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81..a8dbf4be3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		working-tree-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968..0c372069b 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as working tree encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as working tree encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	/*
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (conv_flags & CONV_WRITE_OBJECT)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1120,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1200,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1209,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1234,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1315,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1345,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1377,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1849,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.checkout_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e516..1d9539ed0 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada..e2f319d67 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 000000000..4d85b4277
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,196 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no working tree encoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v5 6/7] convert: add tracing for 'working-tree-encoding' attribute
  2018-01-23 20:53   ` Junio C Hamano
                       ` (5 preceding siblings ...)
  2018-01-29 20:19     ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-29 20:19     ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
  7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Lars Schneider, Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 convert.c                        | 28 ++++++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/convert.c b/convert.c
index 0c372069b..13fad490c 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	/*
 	 * UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -365,6 +390,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 					     enc->name, default_encoding,
 					     &re_src_len);
 
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
 		if (!re_src || src_len != re_src_len ||
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 4d85b4277..0f36d4990 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-23 20:53   ` Junio C Hamano
                       ` (6 preceding siblings ...)
  2018-01-29 20:19     ` [PATCH v5 6/7] convert: add tracing for " tboegi
@ 2018-01-29 20:19     ` tboegi
  2018-01-30 11:23       ` Lars Schneider
  7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
  To: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

UTF-16 encoded files are treated as "binary" by Git, and no CRLF
conversion is done.
When the UTF-16 encoded files are converted into UF-8 using the new
"working-tree-encoding", the CRLF are converted if core.autocrlf is true.

This may lead to confusion:
A tool writes an UTF-16 encoded file with CRLF.
The file is commited with core.autocrlf=true, the CLRF are converted into LF.
The repo is pushed somewhere and cloned by a different user, who has
decided to use core.autocrlf=false.
He uses the same tool, and now the CRLF are not there as expected, but LF,
make the file useless for the tool.

Avoid this (possible) confusion by ignoring core.autocrlf for all files
which have "working-tree-encoding" defined.

The user can still use a .gitattributes file and specify the line endings
like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
file travel together with push and clone.

Change convert.c to e more careful, simplify the initialization when
attributes are retrived (and none are specified) and update the documentation.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 Documentation/gitattributes.txt |  9 ++++++---
 convert.c                       | 15 ++++++++++++---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be3..3665c4677 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.
 
+Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
+Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
+
 Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+with byte order mark (BOM) and you want Git to perform line
+ending conversion based on core.eol.
 
 ------------------------
-*.txt		text working-tree-encoding=UTF-16
+*.txt		working-tree-encoding=UTF-16 text
 ------------------------
 
 Use the following attributes if your '*.txt' files are UTF-16 little
diff --git a/convert.c b/convert.c
index 13fad490c..e7f11d1db 100644
--- a/convert.c
+++ b/convert.c
@@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 		}
 		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
-		ca->drv = NULL;
-		ca->crlf_action = CRLF_UNDEFINED;
-		ca->ident = 0;
+		memset(ca, 0, sizeof(*ca));
 	}
 
 	/* Save attr and make a decision for action */
 	ca->attr_action = ca->crlf_action;
 	if (ca->crlf_action == CRLF_TEXT)
 		ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
+	/*
+	 * Often UTF-16 encoded files are read and written by programs which
+	 * really need CRLF, and it is important to keep the CRLF "as is" when
+	 * files are committed with core.autocrlf=true and the repo is pushed.
+	 * The CRLF would be converted into LF when the repo is cloned to
+	 * a machine with core.autocrlf=false.
+	 * Obey the "text" and "eol" attributes and be independent on the
+	 * local core.autocrlf for all "encoded" files.
+	 */
+	if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
+		ca->crlf_action = CRLF_BINARY;
 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
 		ca->crlf_action = CRLF_BINARY;
 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-29 20:19     ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
@ 2018-01-30 11:23       ` Lars Schneider
  2018-01-30 14:40         ` Torsten Bögershausen
  0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 11:23 UTC (permalink / raw)
  To: tboegi
  Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin


> On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
> 
> From: Torsten Bögershausen <tboegi@web.de>
> 
> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> conversion is done.
> When the UTF-16 encoded files are converted into UF-8 using the new
s/UF-8/UTF-8/


> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
> 
> This may lead to confusion:
> A tool writes an UTF-16 encoded file with CRLF.
> The file is commited with core.autocrlf=true, the CLRF are converted into LF.
> The repo is pushed somewhere and cloned by a different user, who has
> decided to use core.autocrlf=false.
> He uses the same tool, and now the CRLF are not there as expected, but LF,
> make the file useless for the tool.
> 
> Avoid this (possible) confusion by ignoring core.autocrlf for all files
> which have "working-tree-encoding" defined.

Maybe I don't understand your use case but I think this will generate even 
more confusion because that's not what I would expect as a user. I think Git 
should behave consistently independent of the used encoding. Here are my arguments:

  (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
      attribute then nothing changes for you.

  (2) If you use the "working-tree-encoding" attribute *and* you want to ensure 
      your file keeps CRLF then you can define that in the attributes too. E.g.:
      
      *.proj textworking-tree-encoding=UTF-16 eol=crlf

- Lars



> The user can still use a .gitattributes file and specify the line endings
> like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> file travel together with push and clone.
> 
> Change convert.c to e more careful, simplify the initialization when
> attributes are retrived (and none are specified) and update the documentation.
> 
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
> Documentation/gitattributes.txt |  9 ++++++---
> convert.c                       | 15 ++++++++++++---
> 2 files changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index a8dbf4be3..3665c4677 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.
> 
> +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
> +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> +
> Use the following attributes if your '*.txt' files are UTF-16 encoded
> -with byte order mark (BOM) and you want Git to perform automatic line
> -ending conversion based on your platform.
> +with byte order mark (BOM) and you want Git to perform line
> +ending conversion based on core.eol.
> 
> ------------------------
> -*.txt		text working-tree-encoding=UTF-16
> +*.txt		working-tree-encoding=UTF-16 text
> ------------------------
> 
> Use the following attributes if your '*.txt' files are UTF-16 little
> diff --git a/convert.c b/convert.c
> index 13fad490c..e7f11d1db 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> 		}
> 		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
> 	} else {
> -		ca->drv = NULL;
> -		ca->crlf_action = CRLF_UNDEFINED;
> -		ca->ident = 0;
> +		memset(ca, 0, sizeof(*ca));
> 	}
> 
> 	/* Save attr and make a decision for action */
> 	ca->attr_action = ca->crlf_action;
> 	if (ca->crlf_action == CRLF_TEXT)
> 		ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
> +	/*
> +	 * Often UTF-16 encoded files are read and written by programs which
> +	 * really need CRLF, and it is important to keep the CRLF "as is" when
> +	 * files are committed with core.autocrlf=true and the repo is pushed.
> +	 * The CRLF would be converted into LF when the repo is cloned to
> +	 * a machine with core.autocrlf=false.
> +	 * Obey the "text" and "eol" attributes and be independent on the
> +	 * local core.autocrlf for all "encoded" files.
> +	 */
> +	if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
> +		ca->crlf_action = CRLF_BINARY;
> 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
> 		ca->crlf_action = CRLF_BINARY;
> 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
> -- 
> 2.16.0.rc0.2.g64d3e4d0cc.dirty
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-30 11:23       ` Lars Schneider
@ 2018-01-30 14:40         ` Torsten Bögershausen
  2018-01-30 15:14           ` Lars Schneider
  0 siblings, 1 reply; 43+ messages in thread
From: Torsten Bögershausen @ 2018-01-30 14:40 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin

On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
> 
> > On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
> > 
> > From: Torsten Bögershausen <tboegi@web.de>
> > 
> > UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> > conversion is done.
> > When the UTF-16 encoded files are converted into UF-8 using the new
> s/UF-8/UTF-8/
> 
> 
> > "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
> > 
> > This may lead to confusion:
> > A tool writes an UTF-16 encoded file with CRLF.
> > The file is commited with core.autocrlf=true, the CLRF are converted into LF.
> > The repo is pushed somewhere and cloned by a different user, who has
> > decided to use core.autocrlf=false.
> > He uses the same tool, and now the CRLF are not there as expected, but LF,
> > make the file useless for the tool.
> > 
> > Avoid this (possible) confusion by ignoring core.autocrlf for all files
> > which have "working-tree-encoding" defined.
> 
> Maybe I don't understand your use case but I think this will generate even 
> more confusion because that's not what I would expect as a user. I think Git 
> should behave consistently independent of the used encoding. Here are my arguments:

To start with: I have probably seen too many repos with CRLF messed up.

> 
>   (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
>       attribute then nothing changes for you.

People who don't use "working-tree-encoding" are not affected,
I never ment to state that.

I am thinking about people who use "working-tree-encoding" without thinking
about line endings.
Or the ones that have in mind that core.autocrlf=true will leave the
line endings for UTF-16 encoded files as is, but that changes as soon as they
are converted into UTF-8 and the "auto" check is now done
-after- the conversion. I would find that confusing.

> 
>   (2) If you use the "working-tree-encoding" attribute *and* you want to ensure 
>       your file keeps CRLF then you can define that in the attributes too. E.g.:
>       
>       *.proj textworking-tree-encoding=UTF-16 eol=crlf

That is a good one.
If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
make much more sense in Documentation/gitattributes that *.tx
There no text files encoded in UTF-16 wich are called xxx.txt, but those
are non-ideal examples. *.proj makes good sense as an example.


> 
> - Lars
> 
> 
> 
> > The user can still use a .gitattributes file and specify the line endings
> > like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> > file travel together with push and clone.
> > 
> > Change convert.c to e more careful, simplify the initialization when
> > attributes are retrived (and none are specified) and update the documentation.
> > 
> > Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> > ---
> > Documentation/gitattributes.txt |  9 ++++++---
> > convert.c                       | 15 ++++++++++++---
> > 2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> > index a8dbf4be3..3665c4677 100644
> > --- a/Documentation/gitattributes.txt
> > +++ b/Documentation/gitattributes.txt
> > @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
> > UTF-8 encoding and if you want Git to be able to process the content as
> > text.
> > 
> > +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
> > +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> > +
> > Use the following attributes if your '*.txt' files are UTF-16 encoded
> > -with byte order mark (BOM) and you want Git to perform automatic line
> > -ending conversion based on your platform.
> > +with byte order mark (BOM) and you want Git to perform line
> > +ending conversion based on core.eol.
> > 
> > ------------------------
> > -*.txt		text working-tree-encoding=UTF-16
> > +*.txt		working-tree-encoding=UTF-16 text
> > ------------------------
> > 
> > Use the following attributes if your '*.txt' files are UTF-16 little
> > diff --git a/convert.c b/convert.c
> > index 13fad490c..e7f11d1db 100644
> > --- a/convert.c
> > +++ b/convert.c
> > @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> > 		}
> > 		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
> > 	} else {
> > -		ca->drv = NULL;
> > -		ca->crlf_action = CRLF_UNDEFINED;
> > -		ca->ident = 0;
> > +		memset(ca, 0, sizeof(*ca));
> > 	}
> > 
> > 	/* Save attr and make a decision for action */
> > 	ca->attr_action = ca->crlf_action;
> > 	if (ca->crlf_action == CRLF_TEXT)
> > 		ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
> > +	/*
> > +	 * Often UTF-16 encoded files are read and written by programs which
> > +	 * really need CRLF, and it is important to keep the CRLF "as is" when
> > +	 * files are committed with core.autocrlf=true and the repo is pushed.
> > +	 * The CRLF would be converted into LF when the repo is cloned to
> > +	 * a machine with core.autocrlf=false.
> > +	 * Obey the "text" and "eol" attributes and be independent on the
> > +	 * local core.autocrlf for all "encoded" files.
> > +	 */
> > +	if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
> > +		ca->crlf_action = CRLF_BINARY;
> > 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
> > 		ca->crlf_action = CRLF_BINARY;
> > 	if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
> > -- 
> > 2.16.0.rc0.2.g64d3e4d0cc.dirty
> > 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-30 14:40         ` Torsten Bögershausen
@ 2018-01-30 15:14           ` Lars Schneider
  2018-01-31 17:28             ` Torsten Bögershausen
  0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 15:14 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin


> On 30 Jan 2018, at 15:40, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
>> 
>>> On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
>>> 
>>> From: Torsten Bögershausen <tboegi@web.de>
>>> 
>>> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
>>> conversion is done.
>>> When the UTF-16 encoded files are converted into UF-8 using the new
>> s/UF-8/UTF-8/
>> 
>> 
>>> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
>>> 
>>> This may lead to confusion:
>>> A tool writes an UTF-16 encoded file with CRLF.
>>> The file is commited with core.autocrlf=true, the CLRF are converted into LF.
>>> The repo is pushed somewhere and cloned by a different user, who has
>>> decided to use core.autocrlf=false.
>>> He uses the same tool, and now the CRLF are not there as expected, but LF,
>>> make the file useless for the tool.
>>> 
>>> Avoid this (possible) confusion by ignoring core.autocrlf for all files
>>> which have "working-tree-encoding" defined.
>> 
>> Maybe I don't understand your use case but I think this will generate even 
>> more confusion because that's not what I would expect as a user. I think Git 
>> should behave consistently independent of the used encoding. Here are my arguments:
> 
> To start with: I have probably seen too many repos with CRLF messed up.
> 
>> 
>>  (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
>>      attribute then nothing changes for you.
> 
> People who don't use "working-tree-encoding" are not affected,
> I never ment to state that.
> 
> I am thinking about people who use "working-tree-encoding" without thinking
> about line endings.
> Or the ones that have in mind that core.autocrlf=true will leave the
> line endings for UTF-16 encoded files as is, but that changes as soon as they
> are converted into UTF-8 and the "auto" check is now done
> -after- the conversion. I would find that confusing.
> 
>> 
>>  (2) If you use the "working-tree-encoding" attribute *and* you want to ensure 
>>      your file keeps CRLF then you can define that in the attributes too. E.g.:
>> 
>>      *.proj textworking-tree-encoding=UTF-16 eol=crlf
> 
> That is a good one.
> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> make much more sense in Documentation/gitattributes that *.tx
> There no text files encoded in UTF-16 wich are called xxx.txt, but those
> are non-ideal examples. *.proj makes good sense as an example.

OK, I'll do that. Would that fix the problem which this patch tries to address for you?
(I would also explicitly add a paragraph to discuss line endings)

- Lars

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-29 20:19     ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
@ 2018-01-30 19:15       ` Junio C Hamano
  2018-01-30 20:58         ` Lars Schneider
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 19:15 UTC (permalink / raw)
  To: tboegi
  Cc: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider, Lars Schneider

tboegi@web.de writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> has_missing_utf_bom() function returns true if a required BOM is
> missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.

I do not have strong opinion on encoding such policy-ish behaviour
as our default, but am I alone to find that "has missing X" is a
confusing name for a helper function?  "is missing X" (or "lacks
X") is a bit more understandable, I guess.

> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +	return (
> +	   !strcmp(enc, "UTF-16") &&
> +	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +	) || (
> +	   !strcmp(enc, "UTF-32") &&
> +	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +	);
> +}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-29 20:19     ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
@ 2018-01-30 20:05       ` Junio C Hamano
  2018-01-30 20:31         ` Lars Schneider
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 20:05 UTC (permalink / raw)
  To: tboegi
  Cc: lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, --to=larsxschneider, Lars Schneider

tboegi@web.de writes:

> +	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
> +		char *re_src;
> +		int re_src_len;

I think it is a bad idea to 

 (1) not check without CONV_WRITE_OBJECT here.
 (2) hardcode SJIS and do this always and to SJIS alone.

For (1), a fix would be obvious (and that will resurrect the dead
code below).

For (2), perhaps introduce a multi-value configuration variable
core.checkRoundtripEncoding, whose default value consists of just
SJIS, but allow users to add or clear it?

> +		re_src = reencode_string_len(dst, dst_len,
> +					     enc->name, default_encoding,
> +					     &re_src_len);
> +
> +		if (!re_src || src_len != re_src_len ||
> +		    memcmp(src, re_src, src_len)) {
> +			const char* msg = _("encoding '%s' from %s to %s and "
> +					    "back is not the same");
> +			if (conv_flags & CONV_WRITE_OBJECT)
> +				die(msg, path, enc->name, default_encoding);
> +			else
> +				error(msg, path, enc->name, default_encoding);

The "error" side of this inner if() is dead code, I think.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-30 20:05       ` Junio C Hamano
@ 2018-01-30 20:31         ` Lars Schneider
  2018-01-30 21:56           ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 20:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes.Schindelin


> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
> 
> tboegi@web.de writes:
> 
>> +	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>> +		char *re_src;
>> +		int re_src_len;
> 
> I think it is a bad idea to 
> 
> (1) not check without CONV_WRITE_OBJECT here.

The idea is to perform the roundtrip check *only* if we 
actually write to Git. In all other cases we don't care
if the encoding roundtrips.

"git checkout" is such a case where we don't care as 
noted by Peff here:
https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/

Do you agree?


> (2) hardcode SJIS and do this always and to SJIS alone.
> 
> ...
> 
> For (2), perhaps introduce a multi-value configuration variable
> core.checkRoundtripEncoding, whose default value consists of just
> SJIS, but allow users to add or clear it?

Well, in that case I would make it simpler and make
core.checkRoundtripEncoding a boolean that applies to all encodings
if enabled. We could make even simpler than that by removing the entire 
roundtrip check. The thing is, I was not able to come up with a
sequence that would not generate a iconv error *and* not round trip.
Would that be ok for you to remove all that roundtrip checking code?


>> +		re_src = reencode_string_len(dst, dst_len,
>> +					     enc->name, default_encoding,
>> +					     &re_src_len);
>> +
>> +		if (!re_src || src_len != re_src_len ||
>> +		    memcmp(src, re_src, src_len)) {
>> +			const char* msg = _("encoding '%s' from %s to %s and "
>> +					    "back is not the same");
>> +			if (conv_flags & CONV_WRITE_OBJECT)
>> +				die(msg, path, enc->name, default_encoding);
>> +			else
>> +				error(msg, path, enc->name, default_encoding);
> 
> The "error" side of this inner if() is dead code, I think.

Good catch. I think this code should go away if we keep the roundtrip
code and you agree with my statement above.


Thanks a lot for the review,
Lars

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-30 19:15       ` Junio C Hamano
@ 2018-01-30 20:58         ` Lars Schneider
  2018-01-30 21:54           ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 20:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Lars Schneider, Git List,
	Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin


> On 30 Jan 2018, at 20:15, Junio C Hamano <gitster@pobox.com> wrote:
> 
> tboegi@web.de writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> has_missing_utf_bom() function returns true if a required BOM is
>> missing.
>> 
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
> 
> I do not have strong opinion on encoding such policy-ish behaviour
> as our default, but am I alone to find that "has missing X" is a
> confusing name for a helper function?  "is missing X" (or "lacks
> X") is a bit more understandable, I guess.

That might be a german/english translation thingy but I think I get
your point. "has" implies there is something and "missing" implies
there is nothing :)

"is_missing_utf_bom()" might be even a bit unspecific as UTF-8
is usually missing a UTF BOM but the function would still return 
"false". Therefore, "is_missing_required_utf_bom()" might be 
lengthy but should fit.

OK for you?

- Lars


> 
>> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +	return (
>> +	   !strcmp(enc, "UTF-16") &&
>> +	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +	) || (
>> +	   !strcmp(enc, "UTF-32") &&
>> +	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +	);
>> +}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-30 20:58         ` Lars Schneider
@ 2018-01-30 21:54           ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 21:54 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, Lars Schneider, Git List,
	Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> "false". Therefore, "is_missing_required_utf_bom()" might be 
> lengthy but should fit.

Thanks, sounds understandable a lot better than the original ;-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-30 20:31         ` Lars Schneider
@ 2018-01-30 21:56           ` Junio C Hamano
  2018-01-31 19:12             ` Lars Schneider
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 21:56 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>> tboegi@web.de writes:
>> 
>>> +	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>>> +		char *re_src;
>>> +		int re_src_len;
>> 
>> I think it is a bad idea to 
>> 
>> (1) not check without CONV_WRITE_OBJECT here.
>
> The idea is to perform the roundtrip check *only* if we 
> actually write to Git. In all other cases we don't care
> if the encoding roundtrips.
>
> "git checkout" is such a case where we don't care as 
> noted by Peff here:
> https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
>
> Do you agree?

I am not sure why this is special cased and other codepaths have "if
WRITE_OBJECT then die, otherwise error" checks, so no, I do not
agree with your reasoning, at least not yet.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-30 15:14           ` Lars Schneider
@ 2018-01-31 17:28             ` Torsten Bögershausen
  2018-01-31 19:37               ` Lars Schneider
  2018-02-02 19:17               ` Junio C Hamano
  0 siblings, 2 replies; 43+ messages in thread
From: Torsten Bögershausen @ 2018-01-31 17:28 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin

[]
> > That is a good one.
> > If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> > make much more sense in Documentation/gitattributes that *.tx
> > There no text files encoded in UTF-16 wich are called xxx.txt, but those
> > are non-ideal examples. *.proj makes good sense as an example.
> 
> OK, I'll do that. Would that fix the problem which this patch tries to address for you?
> (I would also explicitly add a paragraph to discuss line endings)

Please let me see the patch first, before I can have a comment.

But back to the more general question:

How should Git handle the line endings of UTF-16 files in the woring-tree,
which are UTF-8 in the index?


There are 2 opposite opionions/user expectations here:

a) They are binary in the working tree, so git should leave the line endings
   as is. (Unless specified otherwise in the .attributes file)
b) They are text files in the index. Git will convert line endings
   if core.autocrlf is true (or the .gitattributes file specifies "-text")

My feeling is that both arguments are valid, so let's ask for opinions
and thoughts of others.
Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
What do yo think ?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-30 21:56           ` Junio C Hamano
@ 2018-01-31 19:12             ` Lars Schneider
  2018-01-31 22:45               ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-31 19:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes.Schindelin


> On 30 Jan 2018, at 22:56, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
>>> 
>>> tboegi@web.de writes:
>>> 
>>>> +	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>>>> +		char *re_src;
>>>> +		int re_src_len;
>>> 
>>> I think it is a bad idea to 
>>> 
>>> (1) not check without CONV_WRITE_OBJECT here.
>> 
>> The idea is to perform the roundtrip check *only* if we 
>> actually write to Git. In all other cases we don't care
>> if the encoding roundtrips.
>> 
>> "git checkout" is such a case where we don't care as 
>> noted by Peff here:
>> https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
>> 
>> Do you agree?
> 
> I am not sure why this is special cased and other codepaths have "if
> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
> agree with your reasoning, at least not yet.

The convert_to_git()/encode_to_git() machinery is used in two different
kinds of code paths:

Some code paths actually write to the Git database (indicated by the 
CONV_WRITE_OBJECT flag). I consider these the "critical/important" code 
paths and I don't want to tolerate any encoding errors in these cases as 
the errors would be "forever" in the Git database. That's why I call 
die() on errors for these cases to abort whatever we are doing. 

Other code paths do not write to the Git database (e.g. during "git 
checkout" we use the code to ensure that we are moving away from the 
exact state that we think we are moving away). In these code paths I am 
less concerned about encoding errors. I also don't want to abort the 
operation (e.g. "git checkout") in these cases. That's why I only inform
the user about the problem with an error message.

The encoding round-trip check can be expensive. That's why I decided 
initially to only execute the check in the "critical/important" 
write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also 
decided to run it only if the "SHIFT-JIS" encoding is used as this was
the only encoding that I could find which reportedly does not round-trip
with UTF-8 (although I was not able to replicate the round-trip 
problems). 

I want to change the current implementation as follows:

I want to check the round-trip encoding only if the the environment 
variable "GIT_WORKING_TREE_ENCODING_ROUNDTRIP_CHECK" is set. This way
a user can check the round-trip if necessary for *any* encoding. I
don't want to make it a git config because that setting should only 
rarely be used for debugging purposes.

Performing the round-trip check every time is not necessary from my 
point of view because it can be expensive and I was not able to generate
a test case which *does not* round-trip without triggering any other
iconv error.

- Lars

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-31 17:28             ` Torsten Bögershausen
@ 2018-01-31 19:37               ` Lars Schneider
  2018-02-02 19:17               ` Junio C Hamano
  1 sibling, 0 replies; 43+ messages in thread
From: Lars Schneider @ 2018-01-31 19:37 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes.Schindelin


> On 31 Jan 2018, at 18:28, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> []
>>> That is a good one.
>>> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
>>> make much more sense in Documentation/gitattributes that *.tx
>>> There no text files encoded in UTF-16 wich are called xxx.txt, but those
>>> are non-ideal examples. *.proj makes good sense as an example.
>> 
>> OK, I'll do that. Would that fix the problem which this patch tries to address for you?
>> (I would also explicitly add a paragraph to discuss line endings)
> 
> Please let me see the patch first, before I can have a comment.

Sure! I'll have it ready tomorrow.


> But back to the more general question:
> 
> How should Git handle the line endings of UTF-16 files in the woring-tree,
> which are UTF-8 in the index?
> 
> 
> There are 2 opposite opionions/user expectations here:
> 
> a) They are binary in the working tree, so git should leave the line endings
>   as is. (Unless specified otherwise in the .attributes file)

Well, if you consider your UTF-16 files binary then you would not change
*anything*. You would not enable the new "working-tree-encoding" attribute.
As a consequence, Git's behavior would not change. Git would leave all line 
endings as they are for the UTF-16 files.


> b) They are text files in the index. Git will convert line endings
>   if core.autocrlf is true (or the .gitattributes file specifies "-text")

This would *only* happen if you enable the new "working-tree-encoding"
attribute. In this case a user has already made the conscious decision to
treat these files as text files. Therefore, the user expects Git to handle
them in the same way other text files are handled.


> My feeling is that both arguments are valid, so let's ask for opinions
> and thoughts of others.
> Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
> What do yo think ?

- Lars

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
  2018-01-31 19:12             ` Lars Schneider
@ 2018-01-31 22:45               ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-31 22:45 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

>> I am not sure why this is special cased and other codepaths have "if
>> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
>> agree with your reasoning, at least not yet.
>
> The convert_to_git()/encode_to_git() machinery is used in two different
> kinds of code paths:
>
> Some code paths actually write to the Git database (indicated by the 
> CONV_WRITE_OBJECT flag). I consider these the "critical/important" code 
> paths and I don't want to tolerate any encoding errors in these cases as 
> the errors would be "forever" in the Git database. That's why I call 
> die() on errors for these cases to abort whatever we are doing. 
>
> Other code paths do not write to the Git database (e.g. during "git 
> checkout" we use the code to ensure that we are moving away from the 
> exact state that we think we are moving away). In these code paths I am 
> less concerned about encoding errors. I also don't want to abort the 
> operation (e.g. "git checkout") in these cases. That's why I only inform
> the user about the problem with an error message.

Warning the users early while they are doing non-writing
operation to give them chance to adjust the contents, before they
actually need to register the contents as objects by writing, at
which point we need to die.  That's a reasonable distinction and all
of that I already agree with.

What was questionable and left unexplained was why this roundtrip
thing needs to be different.

> The encoding round-trip check can be expensive. That's why I decided 
> initially to only execute the check in the "critical/important" 
> write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also 
> decided to run it only if the "SHIFT-JIS" encoding is used as this was
> the only encoding that I could find which reportedly does not round-trip
> with UTF-8 (although I was not able to replicate the round-trip 
> problems). 

I still do not see why you have problems with the approach of
maintaining a configurable set of "iffy" encodings (and throw SJIS
into the default list) to achieve all of the above and more.  For
SJIS users, instead of having to set environment variables to obtain
safe behaviour, they automatically get safe behaviour.  When using
encodings that are not problematic, they do not need to spend cycles
checking round-trip.  And when SJIS users know they do not care about
roundtrip checks, they can just configure SJIS away from the list.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-01-31 17:28             ` Torsten Bögershausen
  2018-01-31 19:37               ` Lars Schneider
@ 2018-02-02 19:17               ` Junio C Hamano
  2018-02-07  6:31                 ` Torsten Bögershausen
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-02-02 19:17 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
	Ramsay Jones, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

> There are 2 opposite opionions/user expectations here:
>
> a) They are binary in the working tree, so git should leave the line endings
>    as is. (Unless specified otherwise in the .attributes file)
> ...
> b) They are text files in the index. Git will convert line endings
>    if core.autocrlf is true (or the .gitattributes file specifies "-text")

I sense that you seem to be focusing on the distinction between "in
the working tree" vs "in the index" while contrasting.  The "binary
vs text" in your "binary in wt, text in index" is based on the
default heuristics without any input from end-users or the project
that uses Git that happens to contain such files.  If the users and
the project that uses Git want to treat contents in a path as text,
it is text even when it is (re-)encoded to UTF-16, no?

Such files may be (mis)classified as binary with the default
heuristics when there is no help from what is written in the
.gitattributes file, but here we are talking about the case where
the user explicitly tells us it is in UTF-16, right?  Is there such a
thing as UTF-16 binary?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-02-02 19:17               ` Junio C Hamano
@ 2018-02-07  6:31                 ` Torsten Bögershausen
  2018-02-07 18:12                   ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Torsten Bögershausen @ 2018-02-07  6:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
	Ramsay Jones, Johannes.Schindelin

On Fri, Feb 02, 2018 at 11:17:04AM -0800, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
> 
> > There are 2 opposite opionions/user expectations here:
> >
> > a) They are binary in the working tree, so git should leave the line endings
> >    as is. (Unless specified otherwise in the .attributes file)
> > ...
> > b) They are text files in the index. Git will convert line endings
> >    if core.autocrlf is true (or the .gitattributes file specifies "-text")
> 
> I sense that you seem to be focusing on the distinction between "in
> the working tree" vs "in the index" while contrasting.  The "binary
> vs text" in your "binary in wt, text in index" is based on the
> default heuristics without any input from end-users or the project
> that uses Git that happens to contain such files.  If the users and
> the project that uses Git want to treat contents in a path as text,
> it is text even when it is (re-)encoded to UTF-16, no?
> 
> Such files may be (mis)classified as binary with the default
> heuristics when there is no help from what is written in the
> .gitattributes file, but here we are talking about the case where
> the user explicitly tells us it is in UTF-16, right?  Is there such a
> thing as UTF-16 binary?

I don't think so, by definiton UTF-16 is ment to be text.
(this means that git ls-files --eol needs some update, I can have a look)

Do we agree that UTF-16 is text ?
If yes, could Git assume that the "text" attribute is set when
working-tree-encoding is set ?

I would even go a step further and demand that the user -must- make a decision
about the line endings for working-tree-encoded files:
working-tree-encoding=UTF-16                 # illegal, die()
working-tree-encoding=UTF-16 text=auto       # illegal, die()
working-tree-encoding=UTF-16 -text           # no eol conversion
working-tree-encoding=UTF-16 text            # eol according to core.eol
working-tree-encoding=UTF-16 text eol=lf     # LF
working-tree-encoding=UTF-16 text eol=crlf   # CRLF

What do you think ?




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
  2018-02-07  6:31                 ` Torsten Bögershausen
@ 2018-02-07 18:12                   ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-02-07 18:12 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
	Ramsay Jones, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

>> the user explicitly tells us it is in UTF-16, right?  Is there such a
>> thing as UTF-16 binary?
>
> I don't think so, by definiton UTF-16 is ment to be text.
> (this means that git ls-files --eol needs some update, I can have a look)
>
> Do we agree that UTF-16 is text ?
> If yes, could Git assume that the "text" attribute is set when
> working-tree-encoding is set ?

These are two different questions.  It seems that between the two of
us, we established that "UTF-16 binary" is a nonsense, and a path
that is given working-tree-encoding=UTF-16 must be treated as
holding a text file.  But that does not have direct relevance to the
other question you are asking: "is a question 'does this path have
text attribute set?' be answered with 'yes' if the path has wte
attribute set to UTF-16?"  I think the answer to that latter
question ought to be "no".

By the way, a related tangent is if it makes sense to give
working-tree-encoding to anything that is binary, regardless of the
value---I am inclined to say it is not, so the issue here is not "by
definition UTF-16 is text", but "any path that has wte set to some
enconding should be treated the same way as if the path also has
text attribute set as far as convert machinery is concerned.".


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2018-02-07 18:12 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-20 15:24 [PATCH v4 0/6] convert: add support for different encodings lars.schneider
2018-01-20 15:24 ` [PATCH v4 1/6] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-01-20 15:24 ` [PATCH v4 2/6] strbuf: add xstrdup_toupper() lars.schneider
2018-01-20 15:24 ` [PATCH v4 3/6] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-01-20 15:24 ` [PATCH v4 4/6] utf8: add function to detect a missing " lars.schneider
2018-01-20 15:24 ` [PATCH v4 5/6] convert: add 'working-tree-encoding' attribute lars.schneider
2018-01-21 14:22   ` Simon Ruderich
2018-01-22 12:35     ` Lars Schneider
2018-01-23  0:54       ` Jeff King
2018-01-23 10:25         ` Simon Ruderich
2018-01-23 16:20           ` Jeff King
2018-01-23 21:12             ` Junio C Hamano
2018-01-23  9:27       ` Simon Ruderich
2018-01-22 18:00   ` SQUASH convert: add tracing for " lars.schneider
2018-01-22 19:37     ` Eric Sunshine
2018-01-23 10:17   ` [PATCH v2] " lars.schneider
2018-01-20 15:24 ` [PATCH v4 6/6] " lars.schneider
2018-01-23 20:19 ` [PATCH v4 0/6] convert: add support for different encodings Torsten Bögershausen
2018-01-23 20:53   ` Junio C Hamano
2018-01-29 20:18     ` [PATCH v5 0/7] " tboegi
2018-01-29 20:18     ` [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() tboegi
2018-01-29 20:19     ` [PATCH v5 2/7] strbuf: add xstrdup_toupper() tboegi
2018-01-29 20:19     ` [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
2018-01-29 20:19     ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
2018-01-30 19:15       ` Junio C Hamano
2018-01-30 20:58         ` Lars Schneider
2018-01-30 21:54           ` Junio C Hamano
2018-01-29 20:19     ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
2018-01-30 20:05       ` Junio C Hamano
2018-01-30 20:31         ` Lars Schneider
2018-01-30 21:56           ` Junio C Hamano
2018-01-31 19:12             ` Lars Schneider
2018-01-31 22:45               ` Junio C Hamano
2018-01-29 20:19     ` [PATCH v5 6/7] convert: add tracing for " tboegi
2018-01-29 20:19     ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
2018-01-30 11:23       ` Lars Schneider
2018-01-30 14:40         ` Torsten Bögershausen
2018-01-30 15:14           ` Lars Schneider
2018-01-31 17:28             ` Torsten Bögershausen
2018-01-31 19:37               ` Lars Schneider
2018-02-02 19:17               ` Junio C Hamano
2018-02-07  6:31                 ` Torsten Bögershausen
2018-02-07 18:12                   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).