[PATCH v9 0/8] convert: add support for different encodings

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v9 0/8] convert: add support for different encodings
@ 2018-03-04 20:14 lars.schneider
  2018-03-04 20:14 ` [PATCH v9 1/8] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-4,7 are preparation and helper functions.
Patch 5,6,8 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v8:

* move UTF BOM error checks in a new dedicated function
  validate_encoding() and into a dedicated commit (6)
* remove unnecessary encoding struct (became a plain char*)
* fail early and do not try to run the reencode content in case of a
  validation error
* return early if roundtrip encoding was not found (avoid undefined
  pointer arithmetic)
* fix wrong argument order in encode_to_worktree() error message
* use test_when_finished to cleanup tests
* move UTF-16/32 BOM test file generation into "setup test"
* reduce code duplication in tests
* improve documentation:
    - use *.rc and *.ps1 as examples as they are usually UTF-16 encoded
* fix comment: /advise/advice/

Thanks a lot Eric for your great review! I think I fixed everything you
objected with one exception. You noticed that the current code only
checks for BOMs corresponding to the declared size (16 or 32 bits) [1].
I understand your point of view and I agree that any BOM in these cases
is *most likely* an error. However, according to the spec it might
still be valid. The comments on my related question on StackOverflow
seem to support that view [2]. Therefore, I would like to leave it as
it is in this series. If it turns out to be a problem in practice, then
I am happy to change it later. OK for you?


Thanks,
Lars

[1] https://public-inbox.org/git/DF6F3855-EFE7-4C13-AA53-819AAE0DEF7A@gmail.com/
[2] https://stackoverflow.com/questions/49038872/is-a-utf-32be-bom-valid-in-an-utf-16le-declared-data-stream


  RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
   v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
   v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
   v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
   v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
   v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/
   v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@autodesk.com/
   v8: https://public-inbox.org/git/20180224162801.98860-1-lars.schneider@autodesk.com/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/fdf0d63188
Checkout: git fetch https://github.com/larsxschneider/git encoding-v9 && git checkout fdf0d63188


### Interdiff (v8..v9):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 11315054f4..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -297,14 +297,16 @@ number of pitfalls:
   in your repository, then it is strongly recommended to ensure that all
   clients working with the repository support it.

-  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
-  `working-tree-encoding` enabled Git client, then `foo.proj` will be
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
   stored as UTF-8 internally. A client without `working-tree-encoding`
-  support will checkout `foo.proj` as UTF-8 encoded file. This will
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
   typically cause trouble for the users of this file.

   If a Git client, that does not support the `working-tree-encoding`
-  attribute, adds a new file `bar.proj`, then `bar.proj` will be
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
   stored "as-is" internally (in this example probably as UTF-16).
   A client with `working-tree-encoding` support will interpret the
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
@@ -325,22 +327,22 @@ Use the `working-tree-encoding` attribute only if you cannot store a file
 in UTF-8 encoding and if you want Git to be able to process the content
 as text.

-As an example, use the following attributes if your '*.proj' files are
+As an example, use the following attributes if your '*.ps1' files are
 UTF-16 encoded with byte order mark (BOM) and you want Git to perform
 automatic line ending conversion based on your platform.

 ------------------------
-*.proj		text working-tree-encoding=UTF-16
+*.ps1		text working-tree-encoding=UTF-16
 ------------------------

-Use the following attributes if your '*.proj' files are UTF-16 little
+Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

 ------------------------
-*.proj 		text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
@@ -354,7 +356,7 @@ If you do not know the encoding of a file, then you can use the `file`
 command to guess the encoding:

 ------------------------
-file foo.proj
+file foo.ps1
 ------------------------


diff --git a/convert.c b/convert.c
index 398cd9cf7b..6cbb2b2618 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,53 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,

 }

+static int validate_encoding(const char *path, const char *enc,
+		      const char *data, size_t len, int die_on_error)
+{
+	if (!memcmp("UTF-", enc, 4)) {
+		/*
+		 * Check for detectable errors in UTF encodings
+		 */
+		if (has_prohibited_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is prohibited in '%s' if encoded as %s");
+			/*
+			 * This advice is shown for UTF-??BE and UTF-??LE
+			 * encodings. We truncate the encoding name to 6
+			 * chars with %.6s to cut off the last two "byte
+			 * order" characters.
+			 */
+			const char *advise_msg = _(
+				"The file '%s' contains a byte order "
+				"mark (BOM). Please use %.6s as "
+				"working-tree-encoding.");
+			advise(advise_msg, path, enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+
+		} else if (is_missing_required_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is required in '%s' if encoded as %s");
+			const char *advise_msg = _(
+				"The file '%s' is missing a byte order "
+				"mark (BOM). Please use %sBE or %sLE "
+				"(depending on the byte order) as "
+				"working-tree-encoding.");
+			advise(advise_msg, path, enc, enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+		}
+
+	}
+	return 0;
+}
+
 static void trace_encoding(const char *context, const char *path,
 			   const char *encoding, const char *buf, size_t len)
 {
@@ -297,8 +344,12 @@ static int check_roundtrip(const char* enc_name)
 	 * Search for the given encoding in that string.
 	 */
 	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
-	const char *next = found + strlen(enc_name);
-	int len = strlen(check_roundtrip_encoding);
+	const char *next;
+	int len;
+	if (!found)
+		return 0;
+	next = found + strlen(enc_name);
+	len = strlen(check_roundtrip_encoding);
 	return (found && (
 			/*
 			 * check that the found encoding is at the
@@ -322,17 +373,14 @@ static int check_roundtrip(const char* enc_name)
 		));
 }

-static struct encoding {
-	const char *name;
-	struct encoding *next;
-} *encoding, **encoding_tail;
 static const char *default_encoding = "UTF-8";

 static int encode_to_git(const char *path, const char *src, size_t src_len,
-			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+			 struct strbuf *buf, const char *enc, int conv_flags)
 {
 	char *dst;
 	int dst_len;
+	int die_on_error = conv_flags & CONV_WRITE_OBJECT;

 	/*
 	 * No encoding is specified or there is nothing to encode.
@@ -350,39 +398,11 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (!buf && !src)
 		return 1;

-	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
-		const char *error_msg = _(
-			"BOM is prohibited in '%s' if encoded as %s");
-		/*
-		 * This advise is shown for UTF-??BE and UTF-??LE encodings.
-		 * We truncate the encoding name to 6 chars with %.6s to cut
-		 * off the last two "byte order" characters.
-		 */
-		const char *advise_msg = _(
-			"The file '%s' contains a byte order mark (BOM). "
-			"Please use %.6s as working-tree-encoding.");
-		advise(advise_msg, path, enc->name);
-		if (conv_flags & CONV_WRITE_OBJECT)
-			die(error_msg, path, enc->name);
-		else
-			error(error_msg, path, enc->name);
-
-	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
-		const char *error_msg = _(
-			"BOM is required in '%s' if encoded as %s");
-		const char *advise_msg = _(
-			"The file '%s' is missing a byte order mark (BOM). "
-			"Please use %sBE or %sLE (depending on the byte order) "
-			"as working-tree-encoding.");
-		advise(advise_msg, path, enc->name, enc->name);
-		if (conv_flags & CONV_WRITE_OBJECT)
-			die(error_msg, path, enc->name);
-		else
-			error(error_msg, path, enc->name);
-	}
+	if (validate_encoding(path, enc, src, src_len, die_on_error))
+		return 0;

-	trace_encoding("source", path, enc->name, src, src_len);
-	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+	trace_encoding("source", path, enc, src, src_len);
+	dst = reencode_string_len(src, src_len, default_encoding, enc,
 				  &dst_len);
 	if (!dst) {
 		/*
@@ -392,10 +412,12 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		 * working tree. Let's try to avoid this by screaming loud.
 		 */
 		const char* msg = _("failed to encode '%s' from %s to %s");
-		if (conv_flags & CONV_WRITE_OBJECT)
-			die(msg, path, enc->name, default_encoding);
-		else
-			error(msg, path, enc->name, default_encoding);
+		if (die_on_error)
+			die(msg, path, enc, default_encoding);
+		else {
+			error(msg, path, enc, default_encoding);
+			return 0;
+		}
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);

@@ -418,23 +440,23 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	 * [1] http://unicode.org/faq/utf_bom.html#gen2
 	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
 	 */
-	if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
+	if (die_on_error && check_roundtrip(enc)) {
 		char *re_src;
 		int re_src_len;

 		re_src = reencode_string_len(dst, dst_len,
-					     enc->name, default_encoding,
+					     enc, default_encoding,
 					     &re_src_len);

-		trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
-		trace_encoding("reencoded source", path, enc->name,
+		trace_printf("Checking roundtrip encoding for %s...\n", enc);
+		trace_encoding("reencoded source", path, enc,
 			       re_src, re_src_len);

 		if (!re_src || src_len != re_src_len ||
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
 					    "back is not the same");
-			die(msg, path, enc->name, default_encoding);
+			die(msg, path, enc, default_encoding);
 		}

 		free(re_src);
@@ -445,7 +467,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 }

 static int encode_to_worktree(const char *path, const char *src, size_t src_len,
-			      struct strbuf *buf, struct encoding *enc)
+			      struct strbuf *buf, const char *enc)
 {
 	char *dst;
 	int dst_len;
@@ -457,11 +479,11 @@ static int encode_to_worktree(const char *path, const char *src, size_t src_len,
 	if (!enc || (src && !src_len))
 		return 0;

-	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+	dst = reencode_string_len(src, src_len, enc, default_encoding,
 				  &dst_len);
 	if (!dst) {
 		error("failed to encode '%s' from %s to %s",
-			path, enc->name, default_encoding);
+			path, default_encoding, enc);
 		return 0;
 	}

@@ -1182,33 +1204,23 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }

-static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+static const char *git_path_check_encoding(struct attr_check_item *check)
 {
 	const char *value = check->value;
-	struct encoding *enc;

 	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
 	    !strlen(value))
 		return NULL;

-	for (enc = encoding; enc; enc = enc->next)
-		if (!strcasecmp(value, enc->name))
-			return enc;
-
 	/* Don't encode to the default encoding */
 	if (!strcasecmp(value, default_encoding))
 		return NULL;

-	enc = xcalloc(1, sizeof(*enc));
 	/*
 	 * Ensure encoding names are always upper case (e.g. UTF-8) to
 	 * simplify subsequent string comparisons.
 	 */
-	enc->name = xstrdup_toupper(value);
-	*encoding_tail = enc;
-	encoding_tail = &(enc->next);
-
-	return enc;
+	return xstrdup_toupper(value);
 }

 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
@@ -1266,7 +1278,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
-	struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
+	const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1278,7 +1290,6 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 					 "eol", "text", "working-tree-encoding",
 					 NULL);
 		user_convert_tail = &user_convert;
-		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e34c21eb29..23e89ae623 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,7 +6,7 @@ test_description='working-tree-encoding conversion via gitattributes'

 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING

-test_expect_success 'setup test repo' '
+test_expect_success 'setup test files' '
 	git config core.eol lf &&

 	text="hallo there!\ncan you read me?" &&
@@ -14,164 +14,135 @@ test_expect_success 'setup test repo' '
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
-	cp test.utf16.raw test.utf16 &&
-	cp test.utf32.raw test.utf32 &&
+
+	# Line ending tests
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	# BOM tests
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&

 	# Add only UTF-16 file, we will add the UTF-32 file later
+	cp test.utf16.raw test.utf16 &&
+	cp test.utf32.raw test.utf32 &&
 	git add .gitattributes test.utf16 &&
 	git commit -m initial
 '

 test_expect_success 'ensure UTF-8 is stored in Git' '
-	git cat-file -p :test.utf16 >test.utf16.git &&
-	test_cmp_bin test.utf8.raw test.utf16.git &&
+	test_when_finished "rm -f test.utf16.git" &&

-	# cleanup
-	rm test.utf16.git
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git
 '

 test_expect_success 're-encode to UTF-16 on checkout' '
+	test_when_finished "rm -f test.utf16.raw" &&
+
 	rm test.utf16 &&
 	git checkout test.utf16 &&
-	test_cmp_bin test.utf16.raw test.utf16 &&
-
-	# cleanup
-	rm test.utf16.raw
+	test_cmp_bin test.utf16.raw test.utf16
 '

 test_expect_success 'check $GIT_DIR/info/attributes support' '
-	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+	test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&
+	test_when_finished "git reset --hard HEAD" &&

+	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
 	git add test.utf32 &&

 	git cat-file -p :test.utf32 >test.utf32.git &&
-	test_cmp_bin test.utf8.raw test.utf32.git &&
-
-	# cleanup
-	git reset --hard HEAD &&
-	rm test.utf8.raw test.utf32.raw test.utf32.git
+	test_cmp_bin test.utf8.raw test.utf32.git
 '

-test_expect_success 'check prohibited UTF BOM' '
-	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
-	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
-	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
-	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+for i in 16 32
+do
+	test_expect_success "check prohibited UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&

-	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
-	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
-	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
-	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+		echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
+		echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&

-	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
-	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
-	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
-	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
-
-	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
-	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
-	# the BOM is prohibited.
-	cp bebom.utf16be.raw bebom.utf16be &&
-	test_must_fail git add bebom.utf16be 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
-
-	cp lebom.utf16le.raw lebom.utf16be &&
-	test_must_fail git add lebom.utf16be 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
-
-	cp bebom.utf16be.raw bebom.utf16le &&
-	test_must_fail git add bebom.utf16le 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
-
-	cp lebom.utf16le.raw lebom.utf16le &&
-	test_must_fail git add lebom.utf16le 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
-
-	# ... and the same for UTF-32
-	cp bebom.utf32be.raw bebom.utf32be &&
-	test_must_fail git add bebom.utf32be 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
-
-	cp lebom.utf32le.raw lebom.utf32be &&
-	test_must_fail git add lebom.utf32be 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
-
-	cp bebom.utf32be.raw bebom.utf32le &&
-	test_must_fail git add bebom.utf32le 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
-
-	cp lebom.utf32le.raw lebom.utf32le &&
-	test_must_fail git add lebom.utf32le 2>err.out &&
-	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
-
-	# cleanup
-	git reset --hard HEAD
-'
+		# Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
+		# but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
+		# In these cases the BOM is prohibited.
+		cp bebom.utf${i}be.raw bebom.utf${i}be &&
+		test_must_fail git add bebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+
+		cp lebom.utf${i}le.raw lebom.utf${i}be &&
+		test_must_fail git add lebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&

-test_expect_success 'check required UTF BOM' '
-	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+		cp bebom.utf${i}be.raw bebom.utf${i}le &&
+		test_must_fail git add bebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out &&

-	cp nobom.utf16be.raw nobom.utf16 &&
-	test_must_fail git add nobom.utf16 2>err.out &&
-	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+		cp lebom.utf${i}le.raw lebom.utf${i}le &&
+		test_must_fail git add lebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out
+	'

-	cp nobom.utf16le.raw nobom.utf16 &&
-	test_must_fail git add nobom.utf16 2>err.out &&
-	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+	test_expect_success "check required UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&

-	cp nobom.utf32be.raw nobom.utf32 &&
-	test_must_fail git add nobom.utf32 2>err.out &&
-	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+		echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&

-	cp nobom.utf32le.raw nobom.utf32 &&
-	test_must_fail git add nobom.utf32 2>err.out &&
-	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+		cp nobom.utf${i}be.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out &&

-	# cleanup
-	rm nobom.utf16 nobom.utf32 &&
-	git reset --hard HEAD
+		cp nobom.utf${i}le.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out
 	'

-test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
-	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
-	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
+		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
+		test_when_finished "git reset --hard HEAD^" &&

-	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
-	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
-	cp crlf.utf16.raw eol.utf16 &&
+		cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
+		cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+		cp crlf.utf${i}.raw eol.utf${i} &&

-	cat >expectIndexLF <<-\EOF &&
-		i/lf    w/-text attr/text             	eol.utf16
+		cat >expectIndexLF <<-EOF &&
+			i/lf    w/-text attr/text             	eol.utf${i}
 		EOF

-	git add eol.utf16 &&
+		git add eol.utf${i} &&
 		git commit -m eol &&

-	# UTF-16 with CRLF (Windows line endings)
-	rm eol.utf16 &&
-	git -c core.eol=crlf checkout eol.utf16 &&
-	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+		# UTF-${i} with CRLF (Windows line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=crlf checkout eol.utf${i} &&
+		test_cmp_bin crlf.utf${i}.raw eol.utf${i} &&

-	# Although the file has CRLF in the working tree, ensure LF in the index
-	git ls-files --eol eol.utf16 >actual &&
+		# Although the file has CRLF in the working tree,
+		# ensure LF in the index
+		git ls-files --eol eol.utf${i} >actual &&
 		test_cmp expectIndexLF actual &&

-	# UTF-16 with LF (Unix line endings)
-	rm eol.utf16 &&
-	git -c core.eol=lf checkout eol.utf16 &&
-	test_cmp_bin lf.utf16.raw eol.utf16 &&
+		# UTF-${i} with LF (Unix line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=lf checkout eol.utf${i} &&
+		test_cmp_bin lf.utf${i}.raw eol.utf${i} &&

 		# The file LF in the working tree, ensure LF in the index
-	git ls-files --eol eol.utf16 >actual &&
-	test_cmp expectIndexLF actual&&
-
-	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
-
-	# cleanup
-	git reset --hard HEAD^
+		git ls-files --eol eol.utf${i} >actual &&
+		test_cmp expectIndexLF actual
 	'
+done

 test_expect_success 'check unsupported encodings' '
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard HEAD" &&

 	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
 	printf "nothing" >t.nothing &&
@@ -180,34 +151,33 @@ test_expect_success 'check unsupported encodings' '
 	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
 	printf "garbage" >t.garbage &&
 	test_must_fail git add t.garbage 2>err.out &&
-	test_i18ngrep "fatal: failed to encode" err.out &&
-
-	# cleanup
-	rm err.out &&
-	git reset --hard HEAD
+	test_i18ngrep "fatal: failed to encode" err.out
 '

 test_expect_success 'error if encoding round trip is not the same during refresh' '
 	BEFORE_STATE=$(git rev-parse HEAD) &&
-
-	# Skip the UTF-16 filter for the added file
-	# This simulates a Git version that has no working tree encoding support
-	echo "hallo" >nonsense.utf16 &&
-	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
-	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&
+
+	# Add and commit a UTF-16 file but skip the "working-tree-encoding"
+	# filter. Consequently, the in-repo representation is UTF-16 and not
+	# UTF-8. This simulates a Git version that has no working tree encoding
+	# support.
+	echo "*.utf16le text working-tree-encoding=utf-16le" >.gitattributes &&
+	echo "hallo" >nonsense.utf16le &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16le) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16le &&
 	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
 	git update-ref refs/heads/master $COMMIT &&

 	test_must_fail git checkout HEAD^ 2>err.out &&
-	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
-
-	# cleanup
-	rm err.out &&
-	git reset --hard $BEFORE_STATE
+	test_i18ngrep "error: .* overwritten by checkout:" err.out
 '

 test_expect_success 'error if encoding garbage is already in Git' '
 	BEFORE_STATE=$(git rev-parse HEAD) &&
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&

 	# Skip the UTF-16 filter for the added file
 	# This simulates a Git version that has no checkoutEncoding support
@@ -218,52 +188,46 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	git update-ref refs/heads/master $COMMIT &&

 	git diff 2>err.out &&
-	test_i18ngrep "error: BOM is required" err.out &&
-
-	# cleanup
-	rm err.out &&
-	git reset --hard $BEFORE_STATE
+	test_i18ngrep "error: BOM is required" err.out
 '

 test_expect_success 'check roundtrip encoding' '
+	test_when_finished "rm -f roundtrip.shift roundtrip.utf16" &&
+	test_when_finished "git reset --hard HEAD" &&
+
 	text="hallo there!\nroundtrip test here!" &&
 	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
 	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&

 	# SHIFT-JIS encoded files are round-trip checked by default...
-	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 |
 		grep "Checking roundtrip encoding for SHIFT-JIS" &&
 	git reset &&

 	# ... unless we overwrite the Git config!
-	test_config core.checkRoundtripEncoding "garbage" &&
-	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+	! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage \
+		add .gitattributes roundtrip.shift 2>&1 |
 		grep "Checking roundtrip encoding for SHIFT-JIS" &&
-	test_unconfig core.checkRoundtripEncoding &&
 	git reset &&

 	# UTF-16 encoded files should not be round-trip checked by default...
-	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&

 	# ... unless we tell Git to check it!
-	test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
-	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
+		add roundtrip.utf16 2>&1 |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&

 	# ... unless we tell Git to check it!
 	# (here we also check that the casing of the encoding is irrelevant)
-	test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
-	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
+		add roundtrip.utf16 2>&1 |
 		grep "Checking roundtrip encoding for UTF-16" &&
-	git reset &&
-
-	# cleanup
-	rm roundtrip.shift roundtrip.utf16 &&
-	git reset --hard HEAD
+	git reset
 '

 test_done
diff --git a/utf8.h b/utf8.h
index 62f86fba64..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -71,9 +71,9 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
 		       const char *s);

 /*
- * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
- * or UTF-32LE a BOM must not be used [1]. The function returns true if
- * this is the case.
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
  *
  * [1] http://unicode.org/faq/utf_bom.html#bom10
  */
@@ -83,10 +83,13 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
  * If the endianness is not defined in the encoding name, then we
  * require a BOM. The function returns true if a required BOM is missing.
  *
- * The Unicode standard instructs to assume big-endian if there
- * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
- * encoding standard used in HTML5 recommends to assume
- * little-endian to "deal with deployed content" [3].
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
  *
  * [1] http://unicode.org/faq/utf_bom.html#gen6
  * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf


### Patches

Lars Schneider (8):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: check for detectable errors in UTF encodings
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  88 +++++++++++++
 config.c                         |   5 +
 convert.c                        | 267 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 233 ++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  28 ++++
 12 files changed, 680 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 1/8] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-04 20:14 ` [PATCH v9 2/8] strbuf: add xstrdup_toupper() lars.schneider
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 2/8] strbuf: add xstrdup_toupper()
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
  2018-03-04 20:14 ` [PATCH v9 1/8] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-04 20:14 ` [PATCH v9 3/8] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 3/8] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
  2018-03-04 20:14 ` [PATCH v9 1/8] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-03-04 20:14 ` [PATCH v9 2/8] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-04 20:14 ` [PATCH v9 4/8] utf8: add function to detect a missing " lars.schneider
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 3/8] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-06 20:50   ` Junio C Hamano
  2018-03-04 20:14 ` [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 4/8] utf8: add function to detect a missing " lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-06 20:42   ` Eric Sunshine
  2018-03-04 20:14 ` [PATCH v9 6/8] convert: check for detectable errors in UTF encodings lars.schneider
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  80 +++++++++++++++++++++++
 convert.c                        | 114 ++++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 135 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 330 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+------------------------
+*.ps1		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+------------------------
+*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.ps1
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..2f864df258 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, const char *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+	int die_on_error = conv_flags & CONV_WRITE_OBJECT;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (die_on_error)
+			die(msg, path, enc, default_encoding);
+		else {
+			error(msg, path, enc, default_encoding);
+			return 0;
+		}
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, const char *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, default_encoding, enc);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static const char *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	/*
+	 * Ensure encoding names are always upper case (e.g. UTF-8) to
+	 * simplify subsequent string comparisons.
+	 */
+	return xstrdup_toupper(value);
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1125,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,7 +1134,8 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
 		git_config(read_convert_config, NULL);
 	}
@@ -1064,6 +1158,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1239,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1269,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1301,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1773,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.working_tree_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..71e8e3700b
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,135 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test files' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+
+	# Line ending tests
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	# BOM tests
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	# Add only UTF-16 file, we will add the UTF-32 file later
+	cp test.utf16.raw test.utf16 &&
+	cp test.utf32.raw test.utf32 &&
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	test_when_finished "rm -f test.utf16.git" &&
+
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	test_when_finished "rm -f test.utf16.raw" &&
+
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16
+'
+
+test_expect_success 'check $GIT_DIR/info/attributes support' '
+	test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&
+	test_when_finished "git reset --hard HEAD" &&
+
+	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+	git add test.utf32 &&
+
+	git cat-file -p :test.utf32 >test.utf32.git &&
+	test_cmp_bin test.utf8.raw test.utf32.git
+'
+
+for i in 16 32
+do
+	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
+		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
+		test_when_finished "git reset --hard HEAD^" &&
+
+		cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
+		cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+		cp crlf.utf${i}.raw eol.utf${i} &&
+
+		cat >expectIndexLF <<-EOF &&
+			i/lf    w/-text attr/text             	eol.utf${i}
+		EOF
+
+		git add eol.utf${i} &&
+		git commit -m eol &&
+
+		# UTF-${i} with CRLF (Windows line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=crlf checkout eol.utf${i} &&
+		test_cmp_bin crlf.utf${i}.raw eol.utf${i} &&
+
+		# Although the file has CRLF in the working tree,
+		# ensure LF in the index
+		git ls-files --eol eol.utf${i} >actual &&
+		test_cmp expectIndexLF actual &&
+
+		# UTF-${i} with LF (Unix line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=lf checkout eol.utf${i} &&
+		test_cmp_bin lf.utf${i}.raw eol.utf${i} &&
+
+		# The file LF in the working tree, ensure LF in the index
+		git ls-files --eol eol.utf${i} >actual &&
+		test_cmp expectIndexLF actual
+	'
+done
+
+test_expect_success 'check unsupported encodings' '
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard HEAD" &&
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&
+
+	# Add and commit a UTF-16 file but skip the "working-tree-encoding"
+	# filter. Consequently, the in-repo representation is UTF-16 and not
+	# UTF-8. This simulates a Git version that has no working tree encoding
+	# support.
+	echo "*.utf16le text working-tree-encoding=utf-16le" >.gitattributes &&
+	echo "hallo" >nonsense.utf16le &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16le) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16le &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out
+'
+
+test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-05 21:50   ` Junio C Hamano
  2018-03-06 20:55   ` Eric Sunshine
  2018-03-04 20:14 ` [PATCH v9 7/8] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 50 +++++++++++++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh | 57 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/convert.c b/convert.c
index 2f864df258..9647b06679 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,53 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+		      const char *data, size_t len, int die_on_error)
+{
+	if (!memcmp("UTF-", enc, 4)) {
+		/*
+		 * Check for detectable errors in UTF encodings
+		 */
+		if (has_prohibited_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is prohibited in '%s' if encoded as %s");
+			/*
+			 * This advice is shown for UTF-??BE and UTF-??LE
+			 * encodings. We truncate the encoding name to 6
+			 * chars with %.6s to cut off the last two "byte
+			 * order" characters.
+			 */
+			const char *advise_msg = _(
+				"The file '%s' contains a byte order "
+				"mark (BOM). Please use %.6s as "
+				"working-tree-encoding.");
+			advise(advise_msg, path, enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+
+		} else if (is_missing_required_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is required in '%s' if encoded as %s");
+			const char *advise_msg = _(
+				"The file '%s' is missing a byte order "
+				"mark (BOM). Please use %sBE or %sLE "
+				"(depending on the byte order) as "
+				"working-tree-encoding.");
+			advise(advise_msg, path, enc, enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+		}
+
+	}
+	return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +338,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (!buf && !src)
 		return 1;
 
+	if (validate_encoding(path, enc, src, src_len, die_on_error))
+		return 0;
+
 	dst = reencode_string_len(src, src_len, default_encoding, enc,
 				  &dst_len);
 	if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 71e8e3700b..5c7e36a164 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
 
 for i in 16 32
 do
+	test_expect_success "check prohibited UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&
+
+		echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
+		echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&
+
+		# Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
+		# but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
+		# In these cases the BOM is prohibited.
+		cp bebom.utf${i}be.raw bebom.utf${i}be &&
+		test_must_fail git add bebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+
+		cp lebom.utf${i}le.raw lebom.utf${i}be &&
+		test_must_fail git add lebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+
+		cp bebom.utf${i}be.raw bebom.utf${i}le &&
+		test_must_fail git add bebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out &&
+
+		cp lebom.utf${i}le.raw lebom.utf${i}le &&
+		test_must_fail git add lebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out
+	'
+
+	test_expect_success "check required UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&
+
+		echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&
+
+		cp nobom.utf${i}be.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out &&
+
+		cp nobom.utf${i}le.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out
+	'
+
 	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
 		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
 		test_when_finished "git reset --hard HEAD^" &&
@@ -132,4 +172,21 @@ test_expect_success 'error if encoding round trip is not the same during refresh
 	test_i18ngrep "error: .* overwritten by checkout:" err.out
 '
 
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+	test_when_finished "rm -f err.out" &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out
+'
+
 test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 7/8] convert: add tracing for 'working-tree-encoding' attribute
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 6/8] convert: check for detectable errors in UTF encodings lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-04 20:14 ` [PATCH v9 8/8] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  2018-03-06 21:05 ` [PATCH v9 0/8] convert: add support for different encodings Eric Sunshine
  8 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 25 +++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index 9647b06679..eec34a94b9 100644
--- a/convert.c
+++ b/convert.c
@@ -313,6 +313,29 @@ static int validate_encoding(const char *path, const char *enc,
 	return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -341,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (validate_encoding(path, enc, src, src_len, die_on_error))
 		return 0;
 
+	trace_encoding("source", path, enc, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc,
 				  &dst_len);
 	if (!dst) {
@@ -358,6 +382,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			return 0;
 		}
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5c7e36a164..bdc487b44f 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 8/8] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 7/8] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
@ 2018-03-04 20:14 ` lars.schneider
  2018-03-06 21:05 ` [PATCH v9 0/8] convert: add support for different encodings Eric Sunshine
  8 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2018-03-04 20:14 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/config.txt         |  6 ++++
 Documentation/gitattributes.txt  |  8 +++++
 config.c                         |  5 +++
 convert.c                        | 78 ++++++++++++++++++++++++++++++++++++++++
 convert.h                        |  1 +
 environment.c                    |  1 +
 t/t0028-working-tree-encoding.sh | 39 ++++++++++++++++++++
 7 files changed, 138 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
 	This variable can be set to 'input',
 	in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+	A comma separated list of encodings that Git performs UTF-8 round
+	trip checks on if they are used in an `working-tree-encoding`
+	attribute (see linkgit:gitattributes[5]). The default value is
+	`SHIFT-JIS`.
+
 core.symlinks::
 	If false, symbolic links are checked out as small plain files that
 	contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkroundtripencoding")) {
+		check_roundtrip_encoding = xstrdup(value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.notesref")) {
 		notes_ref_name = xstrdup(value);
 		return 0;
diff --git a/convert.c b/convert.c
index eec34a94b9..6cbb2b2618 100644
--- a/convert.c
+++ b/convert.c
@@ -336,6 +336,43 @@ static void trace_encoding(const char *context, const char *path,
 	strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+	/*
+	 * check_roundtrip_encoding contains a string of space and/or
+	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+	 * Search for the given encoding in that string.
+	 */
+	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+	const char *next;
+	int len;
+	if (!found)
+		return 0;
+	next = found + strlen(enc_name);
+	len = strlen(check_roundtrip_encoding);
+	return (found && (
+			/*
+			 * check that the found encoding is at the
+			 * beginning of check_roundtrip_encoding or
+			 * that it is prefixed with a space or comma
+			 */
+			found == check_roundtrip_encoding || (
+				found > check_roundtrip_encoding &&
+				(*(found-1) == ' ' || *(found-1) == ',')
+			)
+		) && (
+			/*
+			 * check that the found encoding is at the
+			 * end of check_roundtrip_encoding or
+			 * that it is suffixed with a space or comma
+			 */
+			next == check_roundtrip_encoding + len || (
+				next < check_roundtrip_encoding + len &&
+				(*next == ' ' || *next == ',')
+			)
+		));
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -384,6 +421,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
+	/*
+	 * UTF supports lossless conversion round tripping [1] and conversions
+	 * between UTF and other encodings are mostly round trip safe as
+	 * Unicode aims to be a superset of all other character encodings.
+	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+	 * trip issues [2]. Check the round trip conversion for all encodings
+	 * listed in core.checkRoundtripEncoding.
+	 *
+	 * The round trip check is only performed if content is written to Git.
+	 * This ensures that no information is lost during conversion to/from
+	 * the internal UTF-8 representation.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without an iconv error. Iconv errors
+	 * are already caught above.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if (die_on_error && check_roundtrip(enc)) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc, default_encoding,
+					     &re_src_len);
+
+		trace_printf("Checking roundtrip encoding for %s...\n", enc);
+		trace_encoding("reencoded source", path, enc,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			die(msg, path, enc, default_encoding);
+		}
+
+		free(re_src);
+	}
+
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
 }
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };
 
 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index bdc487b44f..23e89ae623 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -191,4 +191,43 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	test_i18ngrep "error: BOM is required" err.out
 '
 
+test_expect_success 'check roundtrip encoding' '
+	test_when_finished "rm -f roundtrip.shift roundtrip.utf16" &&
+	test_when_finished "git reset --hard HEAD" &&
+
+	text="hallo there!\nroundtrip test here!" &&
+	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+	# SHIFT-JIS encoded files are round-trip checked by default...
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# ... unless we overwrite the Git config!
+	! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage \
+		add .gitattributes roundtrip.shift 2>&1 |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# UTF-16 encoded files should not be round-trip checked by default...
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
+		add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	# (here we also check that the casing of the encoding is irrelevant)
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
+		add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset
+'
+
 test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-04 20:14 ` [PATCH v9 6/8] convert: check for detectable errors in UTF encodings lars.schneider
@ 2018-03-05 21:50   ` Junio C Hamano
  2018-03-05 23:45     ` Lars Schneider
  2018-03-06 20:55   ` Eric Sunshine
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-03-05 21:50 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> +static int validate_encoding(const char *path, const char *enc,
> +		      const char *data, size_t len, int die_on_error)
> +{
> +	if (!memcmp("UTF-", enc, 4)) {

Does the caller already know that enc is sufficiently long that
using memcmp is safe?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-05 21:50   ` Junio C Hamano
@ 2018-03-05 23:45     ` Lars Schneider
  2018-03-06  1:23       ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-03-05 23:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	Eric Sunshine, Jeff King, ramsay, Johannes.Schindelin


> On 05 Mar 2018, at 22:50, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +		      const char *data, size_t len, int die_on_error)
>> +{
>> +	if (!memcmp("UTF-", enc, 4)) {
> 
> Does the caller already know that enc is sufficiently long that
> using memcmp is safe?

No :-(

Would you be willing to squash that in?

    if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {

I deliberately used "> 4" as plain "UTF-" is not even valid.


Thanks for spotting this,
Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-05 23:45     ` Lars Schneider
@ 2018-03-06  1:23       ` Junio C Hamano
  2018-03-06 17:03         ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-03-06  1:23 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	Eric Sunshine, Jeff King, ramsay, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 05 Mar 2018, at 22:50, Junio C Hamano <gitster@pobox.com> wrote:
>> 
>> lars.schneider@autodesk.com writes:
>> 
>>> +static int validate_encoding(const char *path, const char *enc,
>>> +		      const char *data, size_t len, int die_on_error)
>>> +{
>>> +	if (!memcmp("UTF-", enc, 4)) {
>> 
>> Does the caller already know that enc is sufficiently long that
>> using memcmp is safe?
>
> No :-(
>
> Would you be willing to squash that in?
>
>     if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {
>
> I deliberately used "> 4" as plain "UTF-" is not even valid.

I'd rather not.  The code does not have to even look at 6th and
later bytes in the enc[] even if it wanted to reject "UTF-" followed
by nothing, but use of strlen() forces it to look at everything.

Stepping back, shouldn't

	if (starts_with(enc, "UTF-") 

be sufficient?  If you really care about the case where "UTF-" alone
comes here, you could write

	if (starts_with(enc, "UTF-") && enc[4])

but I do not think "&& enc[4]" is even needed.  The functions called
from this block would not consider "UTF-" alone as something valid
anyway, so with that "&& enf[4]" we would be piling more code only
for invalid/rare case.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-06  1:23       ` Junio C Hamano
@ 2018-03-06 17:03         ` Lars Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2018-03-06 17:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	Eric Sunshine, Jeff King, ramsay, Johannes.Schindelin


> On 06 Mar 2018, at 02:23, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> On 05 Mar 2018, at 22:50, Junio C Hamano <gitster@pobox.com> wrote:
>>> 
>>> lars.schneider@autodesk.com writes:
>>> 
>>>> +static int validate_encoding(const char *path, const char *enc,
>>>> +		      const char *data, size_t len, int die_on_error)
>>>> +{
>>>> +	if (!memcmp("UTF-", enc, 4)) {
>>> 
>>> Does the caller already know that enc is sufficiently long that
>>> using memcmp is safe?
>> 
>> No :-(
>> 
>> Would you be willing to squash that in?
>> 
>>    if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {
>> 
>> I deliberately used "> 4" as plain "UTF-" is not even valid.
> 
> I'd rather not.  The code does not have to even look at 6th and
> later bytes in the enc[] even if it wanted to reject "UTF-" followed
> by nothing, but use of strlen() forces it to look at everything.
> 
> Stepping back, shouldn't
> 
> 	if (starts_with(enc, "UTF-") 
> 
> be sufficient?  If you really care about the case where "UTF-" alone
> comes here, you could write
> 
> 	if (starts_with(enc, "UTF-") && enc[4])
> 
> but I do not think "&& enc[4]" is even needed.  The functions called
> from this block would not consider "UTF-" alone as something valid
> anyway, so with that "&& enf[4]" we would be piling more code only
> for invalid/rare case.

Agreed, "if (starts_with(enc, "UTF-"))" is sufficient. Can you squash
that in? Thanks for pointing me to starts_with() as I forgot about this
function!

- Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute
  2018-03-04 20:14 ` [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-03-06 20:42   ` Eric Sunshine
  2018-03-06 22:13     ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Sunshine @ 2018-03-06 20:42 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schneider@autodesk.com> wrote:
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> [...]
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> +       [...]
> +       /*
> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
> +        * simplify subsequent string comparisons.
> +        */
> +       return xstrdup_toupper(value);

xstrdup_toupper() allocates memory...

> +}
> @@ -1033,6 +1125,7 @@ struct conv_attrs {
>         enum crlf_action attr_action; /* What attr says */
>         enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
>         int ident;
> +       const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */

...which is assigned to 'const char *'...

>  };
> @@ -1064,6 +1158,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>                         else if (eol_attr == EOL_CRLF)
>                                 ca->crlf_action = CRLF_TEXT_CRLF;
>                 }
> +               ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);

...by this code, and eventually leaked.

It's too bad it isn't cleaned up (freed), but looking at the callers,
fixing this leak would be mildly noisy (though not particularly
invasive). How much do we care about this leak?

>         } else {
>                 ca->drv = NULL;
>                 ca->crlf_action = CRLF_UNDEFINED;
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -0,0 +1,135 @@
> +test_expect_success 'check $GIT_DIR/info/attributes support' '
> +       test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&

It seems weird to be cleaning up files this test didn't create
(test.utf8.raw and test.utf32.raw).

> +       test_when_finished "git reset --hard HEAD" &&
> +
> +       echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
> +       git add test.utf32 &&
> +
> +       git cat-file -p :test.utf32 >test.utf32.git &&
> +       test_cmp_bin test.utf8.raw test.utf32.git
> +'
> +
> +test_expect_success 'check unsupported encodings' '
> +       test_when_finished "rm -f err.out" &&
> +       test_when_finished "git reset --hard HEAD" &&

Resetting to HEAD here is an important cleanup action, but tests don't
usually clean up files such as 'err.out' since such detritus doesn't
usually impact subsequent tests negatively. (Just an observation; no
re-roll needed.)

> +       echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
> +       printf "nothing" >t.nothing &&
> +       git add t.nothing &&
> +
> +       echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
> +       printf "garbage" >t.garbage &&
> +       test_must_fail git add t.garbage 2>err.out &&
> +       test_i18ngrep "fatal: failed to encode" err.out
> +'

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-04 20:14 ` [PATCH v9 4/8] utf8: add function to detect a missing " lars.schneider
@ 2018-03-06 20:50   ` Junio C Hamano
  2018-03-06 22:39     ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-03-06 20:50 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
> +{
> +	return (
> +	   !strcmp(enc, "UTF-16") &&
> +	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> +	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> +	) || (
> +	   !strcmp(enc, "UTF-32") &&
> +	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> +	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> +	);
> +}

These strcmp() calls seem inconsistent with the principle embodied
by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
and make the interface uneven.  I am wondering if we also want to
complain when the user gave us "utf16" and there is no byte order
mark in the contents, for example?  Also "UTF16" or other spelling
the platform may support but this code fails to recognise will go
unchecked.

Which actually may be a feature, not a bug, to be able to bypass
this check---I dunno.

The same comment applies to the previous step.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings
  2018-03-04 20:14 ` [PATCH v9 6/8] convert: check for detectable errors in UTF encodings lars.schneider
  2018-03-05 21:50   ` Junio C Hamano
@ 2018-03-06 20:55   ` Eric Sunshine
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2018-03-06 20:55 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schneider@autodesk.com> wrote:
> Check that new content is valid with respect to the user defined
> 'working-tree-encoding' attribute.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -266,6 +266,53 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static int validate_encoding(const char *path, const char *enc,
> +                     const char *data, size_t len, int die_on_error)
> +{
> +       if (!memcmp("UTF-", enc, 4)) {
> +               [...]
> +               if (has_prohibited_utf_bom(enc, data, len)) {
> +                       [...]
> +                       if (die_on_error)
> +                               die(error_msg, path, enc);
> +                       else {
> +                               return error(error_msg, path, enc);
> +                       }
> +               } [...]
> +       return 0;
> +}
> @@ -291,6 +338,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
> +       if (validate_encoding(path, enc, src, src_len, die_on_error))
> +               return 0;

There could be a cleaner separation of responsibilities here which, as
a nice side-effect, would eliminate the repeated "if (die_on_error)
die(...); else return error(...);" pattern. Rather than passing a
'die_on_error' flag to validate_encoding(), it could accept a 'strbuf
*err' to populate in case of error:

    static int validate_encoding(..., struct strbuf *err)
    {
        if validation error:
            populate 'err'
            return -1;
        return 0
    }

and let the caller be responsible for deciding how to handle failure:

    struct strbuf err = STRBUF_INIT;
    ...
    if (validate_encoding(..., &err)) {
        if (die_on_error)
            die(err.buf);
        else {
            error(err.buf);
            strbuf_release(&err);
            return 0;
        }
    }

Not necessarily worth a re-roll, but perhaps a cleanup someone could
submit at some point if interested.

>         dst = reencode_string_len(src, src_len, default_encoding, enc,
>                                   &dst_len);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 0/8] convert: add support for different encodings
  2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
                   ` (7 preceding siblings ...)
  2018-03-04 20:14 ` [PATCH v9 8/8] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
@ 2018-03-06 21:05 ` Eric Sunshine
  8 siblings, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2018-03-06 21:05 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schneider@autodesk.com> wrote:
> Changes since v8: [...]
>
> Thanks a lot Eric for your great review! I think I fixed everything you
> objected with one exception. You noticed that the current code only
> checks for BOMs corresponding to the declared size (16 or 32 bits) [1].
> I understand your point of view and I agree that any BOM in these cases
> is *most likely* an error. However, according to the spec it might
> still be valid. The comments on my related question on StackOverflow
> seem to support that view [2]. Therefore, I would like to leave it as
> it is in this series. If it turns out to be a problem in practice, then
> I am happy to change it later. OK for you?

Fine. As this is not a correctness issue with the conversion itself,
but rather an attempt to detect misconfiguration by the user, it can
always be revisited later if someone comes up with a more solid
argument that one interpretation is more correct than the other.

I did leave a few review comments on v9, but I don't think any are
necessarily worth a re-roll. If anything, and if they seem actionable,
then perhaps a patch or two atop the series could address them (at
some point).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute
  2018-03-06 20:42   ` Eric Sunshine
@ 2018-03-06 22:13     ` Lars Schneider
  2018-03-06 22:22       ` Eric Sunshine
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-03-06 22:13 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 06 Mar 2018, at 21:42, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schneider@autodesk.com> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> [...]
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +       [...]
>> +       /*
>> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
>> +        * simplify subsequent string comparisons.
>> +        */
>> +       return xstrdup_toupper(value);
> 
> xstrdup_toupper() allocates memory...
> 
>> +}
>> @@ -1033,6 +1125,7 @@ struct conv_attrs {
>>        enum crlf_action attr_action; /* What attr says */
>>        enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
>>        int ident;
>> +       const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
> 
> ...which is assigned to 'const char *'...
> 
>> };
>> @@ -1064,6 +1158,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>>                        else if (eol_attr == EOL_CRLF)
>>                                ca->crlf_action = CRLF_TEXT_CRLF;
>>                }
>> +               ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
> 
> ...by this code, and eventually leaked.
> 
> It's too bad it isn't cleaned up (freed), but looking at the callers,
> fixing this leak would be mildly noisy (though not particularly
> invasive). How much do we care about this leak?

Hmm. You are right. That was previously handled by the encoding struct 
linked list that I removed in this iteration. I forgot about that aspect :/
I don't like it leaking. I think I would like to reintroduce the linked
list. This way every encoding is only once in memory. What do you think?


>>        } else {
>>                ca->drv = NULL;
>>                ca->crlf_action = CRLF_UNDEFINED;
>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,135 @@
>> +test_expect_success 'check $GIT_DIR/info/attributes support' '
>> +       test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&
> 
> It seems weird to be cleaning up files this test didn't create
> (test.utf8.raw and test.utf32.raw).

Agreed.


>> +       test_when_finished "git reset --hard HEAD" &&
>> +
>> +       echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
>> +       git add test.utf32 &&
>> +
>> +       git cat-file -p :test.utf32 >test.utf32.git &&
>> +       test_cmp_bin test.utf8.raw test.utf32.git
>> +'
>> +
>> +test_expect_success 'check unsupported encodings' '
>> +       test_when_finished "rm -f err.out" &&
>> +       test_when_finished "git reset --hard HEAD" &&
> 
> Resetting to HEAD here is an important cleanup action, but tests don't
> usually clean up files such as 'err.out' since such detritus doesn't
> usually impact subsequent tests negatively. (Just an observation; no
> re-roll needed.)

OK. I'll fix it if I reroll.

- Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute
  2018-03-06 22:13     ` Lars Schneider
@ 2018-03-06 22:22       ` Eric Sunshine
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2018-03-06 22:22 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin

On Tue, Mar 6, 2018 at 5:13 PM, Lars Schneider <larsxschneider@gmail.com> wrote:
>> On 06 Mar 2018, at 21:42, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schneider@autodesk.com> wrote:
>>> +       return xstrdup_toupper(value);
>>
>> xstrdup_toupper() allocates memory...
>>
>>> +       const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
>>
>> ...which is assigned to 'const char *'...
>>
>>> +               ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
>>
>> ...by this code, and eventually leaked.
>>
>> It's too bad it isn't cleaned up (freed), but looking at the callers,
>> fixing this leak would be mildly noisy (though not particularly
>> invasive). How much do we care about this leak?
>
> Hmm. You are right. That was previously handled by the encoding struct
> linked list that I removed in this iteration. I forgot about that aspect :/
> I don't like it leaking. I think I would like to reintroduce the linked
> list. This way every encoding is only once in memory. What do you think?

It's subjective, but I find the use of a linked-list just for the
purpose of not leaking these strings unnecessarily confusing.

If I was doing it, I'd probably add a conv_attrs_free() function and
call it from each function allocates a 'struct conv_attrs' (including
calling it before early returns -- which prompted my earlier comment
about it being a "mildly noisy" fix).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 20:50   ` Junio C Hamano
@ 2018-03-06 22:39     ` Lars Schneider
  2018-03-06 22:53       ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-03-06 22:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 06 Mar 2018, at 21:50, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> +	return (
>> +	   !strcmp(enc, "UTF-16") &&
>> +	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +	) || (
>> +	   !strcmp(enc, "UTF-32") &&
>> +	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> +	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +	);
>> +}
> 
> These strcmp() calls seem inconsistent with the principle embodied
> by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
> and make the interface uneven. I am wondering if we also want to
> complain when the user gave us "utf16" and there is no byte order
> mark in the contents, for example?

Well, if I use stricmp() then I don't need to call and cleanup
xstrdup_toupper() as discussed with Eric [1]. Is there a case
insensitive starts_with() method?

[1] https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jPpDM6K4jJTe-74FecQ@mail.gmail.com/


>  Also "UTF16" or other spelling
> the platform may support but this code fails to recognise will go
> unchecked.

That is true. However, I would assume all iconv implementations use the
same encoding names for UTF encodings, no? That means UTF16 would never be
valid. Would you agree?

- Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 22:39     ` Lars Schneider
@ 2018-03-06 22:53       ` Junio C Hamano
  2018-03-06 23:00         ` Lars Schneider
  2018-03-06 23:07         ` Junio C Hamano
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2018-03-06 22:53 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

>>  Also "UTF16" or other spelling
>> the platform may support but this code fails to recognise will go
>> unchecked.
>
> That is true. However, I would assume all iconv implementations use the
> same encoding names for UTF encodings, no? That means UTF16 would never be
> valid. Would you agree?

After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
I am not sure what you mean by "Would you agree?"  People can say in
their .gitattributes file that this path is in "UTF16" without dash
and that is what will be fed to this coe, no?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 22:53       ` Junio C Hamano
@ 2018-03-06 23:00         ` Lars Schneider
  2018-03-06 23:07         ` Junio C Hamano
  1 sibling, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2018-03-06 23:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 06 Mar 2018, at 23:53, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> Also "UTF16" or other spelling
>>> the platform may support but this code fails to recognise will go
>>> unchecked.
>> 
>> That is true. However, I would assume all iconv implementations use the
>> same encoding names for UTF encodings, no? That means UTF16 would never be
>> valid. Would you agree?
> 
> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
> I am not sure what you mean by "Would you agree?"  People can say in
> their .gitattributes file that this path is in "UTF16" without dash
> and that is what will be fed to this coe, no?

On macOS I don't see UTF16... but I just checked on my Linux box and
there it is. Consequently, we need to check both versions: with and
without dash.

I'll reroll ASAP (I try to do it first thing in the morning).

Thanks,
Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 22:53       ` Junio C Hamano
  2018-03-06 23:00         ` Lars Schneider
@ 2018-03-06 23:07         ` Junio C Hamano
  2018-03-06 23:12           ` Lars Schneider
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-03-06 23:07 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Junio C Hamano <gitster@pobox.com> writes:

> Lars Schneider <larsxschneider@gmail.com> writes:
>
>>>  Also "UTF16" or other spelling
>>> the platform may support but this code fails to recognise will go
>>> unchecked.
>>
>> That is true. However, I would assume all iconv implementations use the
>> same encoding names for UTF encodings, no? That means UTF16 would never be
>> valid. Would you agree?
>
> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
> I am not sure what you mean by "Would you agree?"  People can say in
> their .gitattributes file that this path is in "UTF16" without dash
> and that is what will be fed to this code, no?

FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am
reasonably sure that people expect downcased ones to work as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 23:07         ` Junio C Hamano
@ 2018-03-06 23:12           ` Lars Schneider
  2018-03-06 23:37             ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-03-06 23:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 07 Mar 2018, at 00:07, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Lars Schneider <larsxschneider@gmail.com> writes:
>> 
>>>> Also "UTF16" or other spelling
>>>> the platform may support but this code fails to recognise will go
>>>> unchecked.
>>> 
>>> That is true. However, I would assume all iconv implementations use the
>>> same encoding names for UTF encodings, no? That means UTF16 would never be
>>> valid. Would you agree?
>> 
>> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
>> I am not sure what you mean by "Would you agree?"  People can say in
>> their .gitattributes file that this path is in "UTF16" without dash
>> and that is what will be fed to this code, no?
> 
> FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am
> reasonably sure that people expect downcased ones to work as well.

Sure! That's why I normalized it to upper case:
https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jPpDM6K4jJTe-74FecQ@mail.gmail.com/

After thinking about it I wonder if we should barf on "utf16" without
dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
That means the repo would checkout correctly on your machine but not on mine.

What do you think?

- Lars


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 23:12           ` Lars Schneider
@ 2018-03-06 23:37             ` Junio C Hamano
  2018-03-07 17:39               ` Torsten Bögershausen
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-03-06 23:37 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> After thinking about it I wonder if we should barf on "utf16" without
> dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
> That means the repo would checkout correctly on your machine but not on mine.
>
> What do you think?

To be bluntly honest, I prefer not to have excess "sanity checks";
there is no need to barf on utf16 in a project run by those who are
without macOS friends, for example.  For that matter, while I do not
hate it so much to reject it, I am not all that supportive of this
"The consortium says without -LE or -BE suffix there must be BOM,
and this lacks one, so barf loudly" step in this topic myself.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-06 23:37             ` Junio C Hamano
@ 2018-03-07 17:39               ` Torsten Bögershausen
  0 siblings, 0 replies; 26+ messages in thread
From: Torsten Bögershausen @ 2018-03-07 17:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

On Tue, Mar 06, 2018 at 03:37:16PM -0800, Junio C Hamano wrote:
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
> > After thinking about it I wonder if we should barf on "utf16" without
> > dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
> > That means the repo would checkout correctly on your machine but not on mine.
> >
> > What do you think?
> 
> To be bluntly honest, I prefer not to have excess "sanity checks";
> there is no need to barf on utf16 in a project run by those who are
> without macOS friends, for example.  For that matter, while I do not
> hate it so much to reject it, I am not all that supportive of this
> "The consortium says without -LE or -BE suffix there must be BOM,
> and this lacks one, so barf loudly" step in this topic myself.

Loosing or adding a BOM couild render a file useless, depending
on the SW that reads and processes it.

The main reason for being critacal is to make sure that the
material in the working-tree does a safe roundtrip when commited,
pushed, pulled, checked out...

The best thing we can do is probably to follow the specification as
good as possible.

Having a clear specification (UTF-16LE/BE has no BOM, UTF-16 has none,
UTF-16 has a BOM) would even open the chance to work around a buggy
iconv library, because Git can check the BOM.
If, and only if, a platform messes it up, and we want to
make a (compile time) workaround in Git for this very platform.

A consistant behavior across all platforms is a good thing,
sometimes I have a network share mounted to Linux, MacOS and
Windows and it doesn't matter under which Git on which
machine I checkout or commit.

Oh, I see, there is a new patch coming...

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-03-07 17:39 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-04 20:14 [PATCH v9 0/8] convert: add support for different encodings lars.schneider
2018-03-04 20:14 ` [PATCH v9 1/8] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-03-04 20:14 ` [PATCH v9 2/8] strbuf: add xstrdup_toupper() lars.schneider
2018-03-04 20:14 ` [PATCH v9 3/8] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-03-04 20:14 ` [PATCH v9 4/8] utf8: add function to detect a missing " lars.schneider
2018-03-06 20:50   ` Junio C Hamano
2018-03-06 22:39     ` Lars Schneider
2018-03-06 22:53       ` Junio C Hamano
2018-03-06 23:00         ` Lars Schneider
2018-03-06 23:07         ` Junio C Hamano
2018-03-06 23:12           ` Lars Schneider
2018-03-06 23:37             ` Junio C Hamano
2018-03-07 17:39               ` Torsten Bögershausen
2018-03-04 20:14 ` [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute lars.schneider
2018-03-06 20:42   ` Eric Sunshine
2018-03-06 22:13     ` Lars Schneider
2018-03-06 22:22       ` Eric Sunshine
2018-03-04 20:14 ` [PATCH v9 6/8] convert: check for detectable errors in UTF encodings lars.schneider
2018-03-05 21:50   ` Junio C Hamano
2018-03-05 23:45     ` Lars Schneider
2018-03-06  1:23       ` Junio C Hamano
2018-03-06 17:03         ` Lars Schneider
2018-03-06 20:55   ` Eric Sunshine
2018-03-04 20:14 ` [PATCH v9 7/8] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
2018-03-04 20:14 ` [PATCH v9 8/8] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-03-06 21:05 ` [PATCH v9 0/8] convert: add support for different encodings Eric Sunshine

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).