git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v10 0/9] convert: add support for different encodings
@ 2018-03-07 17:30 lars.schneider
  2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-5,8 are preparation and helper functions. Patch 3 is new.
Patch 6,7,9 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v9:

* make has_bom_prefix() / is_missing_required_utf_bom() more lenient in
  what they accept (ignore casing, accept UTF?? and UTF-?? , Junio)
* replace memcmp() which does not check the length of the strings with
  a case insensitive variant of starts_with() (Junio)
* do not convert encoding names to uppercase
  (this fixes a leak introduced in the last iteration, Eric)
* do not cleanup test files that the test did not create (Eric)
* do not cleanup err.out files in tests (Eric)

I did not address Eric's feedback to make validate_encoding()
cleaner [1] as I want to stabilize the series and Eric wrote
that we can clean this up later:
http://public-inbox.org/git/CAPig+cSoka-yBTYBz42JGQTyCH7LDWnToeOvdZfG0_64o9QnBQ@mail.gmail.com

Thanks,
Lars


  RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
   v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
   v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
   v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
   v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
   v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/
   v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@autodesk.com/
   v8: https://public-inbox.org/git/20180224162801.98860-1-lars.schneider@autodesk.com/
   v9: https://public-inbox.org/git/20180304201418.60958-1-lars.schneider@autodesk.com/



Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/a602b8dcef
Checkout: git fetch https://github.com/larsxschneider/git encoding-v10 && git checkout a602b8dcef


### Interdiff (v9..v10):

diff --git a/convert.c b/convert.c
index 6cbb2b2618..e861f1abbc 100644
--- a/convert.c
+++ b/convert.c
@@ -269,7 +269,8 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static int validate_encoding(const char *path, const char *enc,
 		      const char *data, size_t len, int die_on_error)
 {
-	if (!memcmp("UTF-", enc, 4)) {
+	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+	if (startscase_with(enc, "UTF")) {
 		/*
 		 * Check for detectable errors in UTF encodings
 		 */
@@ -277,16 +278,18 @@ static int validate_encoding(const char *path, const char *enc,
 			const char *error_msg = _(
 				"BOM is prohibited in '%s' if encoded as %s");
 			/*
-			 * This advice is shown for UTF-??BE and UTF-??LE
-			 * encodings. We truncate the encoding name to 6
-			 * chars with %.6s to cut off the last two "byte
-			 * order" characters.
+			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
+			 * We cut off the last two characters of the encoding name
+			 # to generate the encoding name suitable for BOMs.
 			 */
 			const char *advise_msg = _(
 				"The file '%s' contains a byte order "
-				"mark (BOM). Please use %.6s as "
+				"mark (BOM). Please use %s as "
 				"working-tree-encoding.");
-			advise(advise_msg, path, enc);
+			char *upper_enc = xstrdup_toupper(enc);
+			upper_enc[strlen(upper_enc)-2] = '\0';
+			advise(advise_msg, path, upper_enc);
+			free(upper_enc);
 			if (die_on_error)
 				die(error_msg, path, enc);
 			else {
@@ -301,7 +304,9 @@ static int validate_encoding(const char *path, const char *enc,
 				"mark (BOM). Please use %sBE or %sLE "
 				"(depending on the byte order) as "
 				"working-tree-encoding.");
-			advise(advise_msg, path, enc, enc);
+			char *upper_enc = xstrdup_toupper(enc);
+			advise(advise_msg, path, upper_enc, upper_enc);
+			free(upper_enc);
 			if (die_on_error)
 				die(error_msg, path, enc);
 			else {
@@ -1216,11 +1221,7 @@ static const char *git_path_check_encoding(struct attr_check_item *check)
 	if (!strcasecmp(value, default_encoding))
 		return NULL;

-	/*
-	 * Ensure encoding names are always upper case (e.g. UTF-8) to
-	 * simplify subsequent string comparisons.
-	 */
-	return xstrdup_toupper(value);
+	return value;
 }

 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..f648da0c11 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));

 extern int starts_with(const char *str, const char *prefix);
+extern int startscase_with(const char *str, const char *prefix);

 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..5779a2d591 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
 			return 0;
 }

+int startscase_with(const char *str, const char *prefix)
+{
+	for (; ; str++, prefix++)
+		if (!*prefix)
+			return 1;
+		else if (tolower(*str) != tolower(*prefix))
+			return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 				 const char **arg, const char *def)
 {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 23e89ae623..7cff41a350 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -52,7 +52,7 @@ test_expect_success 're-encode to UTF-16 on checkout' '
 '

 test_expect_success 'check $GIT_DIR/info/attributes support' '
-	test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&
+	test_when_finished "rm -f test.utf32.git" &&
 	test_when_finished "git reset --hard HEAD" &&

 	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
@@ -75,19 +75,19 @@ do
 		# In these cases the BOM is prohibited.
 		cp bebom.utf${i}be.raw bebom.utf${i}be &&
 		test_must_fail git add bebom.utf${i}be 2>err.out &&
-		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&

 		cp lebom.utf${i}le.raw lebom.utf${i}be &&
 		test_must_fail git add lebom.utf${i}be 2>err.out &&
-		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&

 		cp bebom.utf${i}be.raw bebom.utf${i}le &&
 		test_must_fail git add bebom.utf${i}le 2>err.out &&
-		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&

 		cp lebom.utf${i}le.raw lebom.utf${i}le &&
 		test_must_fail git add lebom.utf${i}le 2>err.out &&
-		test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
 	'

 	test_expect_success "check required UTF-${i} BOM" '
@@ -97,11 +97,11 @@ do

 		cp nobom.utf${i}be.raw nobom.utf${i} &&
 		test_must_fail git add nobom.utf${i} 2>err.out &&
-		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out &&
+		test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&

 		cp nobom.utf${i}le.raw nobom.utf${i} &&
 		test_must_fail git add nobom.utf${i} 2>err.out &&
-		test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out
+		test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
 	'

 	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
@@ -141,7 +141,6 @@ do
 done

 test_expect_success 'check unsupported encodings' '
-	test_when_finished "rm -f err.out" &&
 	test_when_finished "git reset --hard HEAD" &&

 	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
@@ -156,7 +155,6 @@ test_expect_success 'check unsupported encodings' '

 test_expect_success 'error if encoding round trip is not the same during refresh' '
 	BEFORE_STATE=$(git rev-parse HEAD) &&
-	test_when_finished "rm -f err.out" &&
 	test_when_finished "git reset --hard $BEFORE_STATE" &&

 	# Add and commit a UTF-16 file but skip the "working-tree-encoding"
@@ -176,7 +174,6 @@ test_expect_success 'error if encoding round trip is not the same during refresh

 test_expect_success 'error if encoding garbage is already in Git' '
 	BEFORE_STATE=$(git rev-parse HEAD) &&
-	test_when_finished "rm -f err.out" &&
 	test_when_finished "git reset --hard $BEFORE_STATE" &&

 	# Skip the UTF-16 filter for the added file
@@ -219,14 +216,14 @@ test_expect_success 'check roundtrip encoding' '
 	# ... unless we tell Git to check it!
 	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
 		add roundtrip.utf16 2>&1 |
-		grep "Checking roundtrip encoding for UTF-16" &&
+		grep "Checking roundtrip encoding for utf-16" &&
 	git reset &&

 	# ... unless we tell Git to check it!
 	# (here we also check that the casing of the encoding is irrelevant)
 	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
 		add roundtrip.utf16 2>&1 |
-		grep "Checking roundtrip encoding for UTF-16" &&
+		grep "Checking roundtrip encoding for utf-16" &&
 	git reset
 '

diff --git a/utf8.c b/utf8.c
index 5113d26e56..81c6678df1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -552,11 +552,13 @@ static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
-	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") ||
+	   !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) &&
 	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
 	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
 	) || (
-	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") ||
+	   !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) &&
 	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
 	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
 	);
@@ -565,11 +567,11 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
-	   !strcmp(enc, "UTF-16") &&
+	   (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) &&
 	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
 	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
 	) || (
-	   !strcmp(enc, "UTF-32") &&
+	   (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) &&
 	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
 	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
 	);


### Patches

Lars Schneider (9):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  strbuf: add a case insensitive starts_with()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: check for detectable errors in UTF encodings
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  88 +++++++++++++
 config.c                         |   5 +
 convert.c                        | 268 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 git-compat-util.h                |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  22 +++-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 230 +++++++++++++++++++++++++++++++++
 utf8.c                           |  39 ++++++
 utf8.h                           |  28 ++++
 13 files changed, 690 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:30 ` [PATCH v10 2/9] strbuf: add xstrdup_toupper() lars.schneider
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 2/9] strbuf: add xstrdup_toupper()
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
  2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
  2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-03-07 17:30 ` [PATCH v10 2/9] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-08  0:31   ` Duy Nguyen
  2018-03-07 17:30 ` [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Check in a case insensitive manner if one string is a prefix of another
string.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 git-compat-util.h | 1 +
 strbuf.c          | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..f648da0c11 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));
 
 extern int starts_with(const char *str, const char *prefix);
+extern int startscase_with(const char *str, const char *prefix);
 
 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..5779a2d591 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
 			return 0;
 }
 
+int startscase_with(const char *str, const char *prefix)
+{
+	for (; ; str++, prefix++)
+		if (!*prefix)
+			return 1;
+		else if (tolower(*str) != tolower(*prefix))
+			return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 				 const char **arg, const char *def)
 {
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:30 ` [PATCH v10 5/9] utf8: add function to detect a missing " lars.schneider
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 26 ++++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 35 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..e4b99580f0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,32 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") ||
+	   !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") ||
+	   !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 5/9] utf8: add function to detect a missing UTF-16/32 BOM
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index e4b99580f0..81c6678df1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -564,6 +564,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 5/9] utf8: add function to detect a missing " lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:54   ` Eric Sunshine
  2018-03-07 19:35   ` Junio C Hamano
  2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  80 +++++++++++++++++++++++
 convert.c                        | 110 +++++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 133 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 324 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+------------------------
+*.ps1		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+------------------------
+*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.ps1
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..80549ff2b5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, const char *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+	int die_on_error = conv_flags & CONV_WRITE_OBJECT;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (die_on_error)
+			die(msg, path, enc, default_encoding);
+		else {
+			error(msg, path, enc, default_encoding);
+			return 0;
+		}
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, const char *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, default_encoding, enc);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static const char *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	return value;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1121,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	const char *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,7 +1130,8 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
 		git_config(read_convert_config, NULL);
 	}
@@ -1064,6 +1154,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1235,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1265,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1297,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1769,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.working_tree_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..89b4dbee1d
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,133 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test files' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+
+	# Line ending tests
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	# BOM tests
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	# Add only UTF-16 file, we will add the UTF-32 file later
+	cp test.utf16.raw test.utf16 &&
+	cp test.utf32.raw test.utf32 &&
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	test_when_finished "rm -f test.utf16.git" &&
+
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	test_when_finished "rm -f test.utf16.raw" &&
+
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16
+'
+
+test_expect_success 'check $GIT_DIR/info/attributes support' '
+	test_when_finished "rm -f test.utf32.git" &&
+	test_when_finished "git reset --hard HEAD" &&
+
+	echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+	git add test.utf32 &&
+
+	git cat-file -p :test.utf32 >test.utf32.git &&
+	test_cmp_bin test.utf8.raw test.utf32.git
+'
+
+for i in 16 32
+do
+	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
+		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
+		test_when_finished "git reset --hard HEAD^" &&
+
+		cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
+		cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+		cp crlf.utf${i}.raw eol.utf${i} &&
+
+		cat >expectIndexLF <<-EOF &&
+			i/lf    w/-text attr/text             	eol.utf${i}
+		EOF
+
+		git add eol.utf${i} &&
+		git commit -m eol &&
+
+		# UTF-${i} with CRLF (Windows line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=crlf checkout eol.utf${i} &&
+		test_cmp_bin crlf.utf${i}.raw eol.utf${i} &&
+
+		# Although the file has CRLF in the working tree,
+		# ensure LF in the index
+		git ls-files --eol eol.utf${i} >actual &&
+		test_cmp expectIndexLF actual &&
+
+		# UTF-${i} with LF (Unix line endings)
+		rm eol.utf${i} &&
+		git -c core.eol=lf checkout eol.utf${i} &&
+		test_cmp_bin lf.utf${i}.raw eol.utf${i} &&
+
+		# The file LF in the working tree, ensure LF in the index
+		git ls-files --eol eol.utf${i} >actual &&
+		test_cmp expectIndexLF actual
+	'
+done
+
+test_expect_success 'check unsupported encodings' '
+	test_when_finished "git reset --hard HEAD" &&
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&
+
+	# Add and commit a UTF-16 file but skip the "working-tree-encoding"
+	# filter. Consequently, the in-repo representation is UTF-16 and not
+	# UTF-8. This simulates a Git version that has no working tree encoding
+	# support.
+	echo "*.utf16le text working-tree-encoding=utf-16le" >.gitattributes &&
+	echo "hallo" >nonsense.utf16le &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16le) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16le &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out
+'
+
+test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 18:04   ` Eric Sunshine
  2018-03-07 19:49   ` Junio C Hamano
  2018-03-07 17:30 ` [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
  2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  8 siblings, 2 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 55 +++++++++++++++++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh | 56 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 111 insertions(+)

diff --git a/convert.c b/convert.c
index 80549ff2b5..f19a15dd02 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+		      const char *data, size_t len, int die_on_error)
+{
+	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+	if (startscase_with(enc, "UTF")) {
+		/*
+		 * Check for detectable errors in UTF encodings
+		 */
+		if (has_prohibited_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is prohibited in '%s' if encoded as %s");
+			/*
+			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
+			 * We cut off the last two characters of the encoding name
+			 # to generate the encoding name suitable for BOMs.
+			 */
+			const char *advise_msg = _(
+				"The file '%s' contains a byte order "
+				"mark (BOM). Please use %s as "
+				"working-tree-encoding.");
+			char *upper_enc = xstrdup_toupper(enc);
+			upper_enc[strlen(upper_enc)-2] = '\0';
+			advise(advise_msg, path, upper_enc);
+			free(upper_enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+
+		} else if (is_missing_required_utf_bom(enc, data, len)) {
+			const char *error_msg = _(
+				"BOM is required in '%s' if encoded as %s");
+			const char *advise_msg = _(
+				"The file '%s' is missing a byte order "
+				"mark (BOM). Please use %sBE or %sLE "
+				"(depending on the byte order) as "
+				"working-tree-encoding.");
+			char *upper_enc = xstrdup_toupper(enc);
+			advise(advise_msg, path, upper_enc, upper_enc);
+			free(upper_enc);
+			if (die_on_error)
+				die(error_msg, path, enc);
+			else {
+				return error(error_msg, path, enc);
+			}
+		}
+
+	}
+	return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +343,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (!buf && !src)
 		return 1;
 
+	if (validate_encoding(path, enc, src, src_len, die_on_error))
+		return 0;
+
 	dst = reencode_string_len(src, src_len, default_encoding, enc,
 				  &dst_len);
 	if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 89b4dbee1d..aa05f82166 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
 
 for i in 16 32
 do
+	test_expect_success "check prohibited UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&
+
+		echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
+		echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&
+
+		# Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
+		# but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
+		# In these cases the BOM is prohibited.
+		cp bebom.utf${i}be.raw bebom.utf${i}be &&
+		test_must_fail git add bebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
+
+		cp lebom.utf${i}le.raw lebom.utf${i}be &&
+		test_must_fail git add lebom.utf${i}be 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
+
+		cp bebom.utf${i}be.raw bebom.utf${i}le &&
+		test_must_fail git add bebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&
+
+		cp lebom.utf${i}le.raw lebom.utf${i}le &&
+		test_must_fail git add lebom.utf${i}le 2>err.out &&
+		test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
+	'
+
+	test_expect_success "check required UTF-${i} BOM" '
+		test_when_finished "git reset --hard HEAD" &&
+
+		echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&
+
+		cp nobom.utf${i}be.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
+
+		cp nobom.utf${i}le.raw nobom.utf${i} &&
+		test_must_fail git add nobom.utf${i} 2>err.out &&
+		test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
+	'
+
 	test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
 		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
 		test_when_finished "git reset --hard HEAD^" &&
@@ -130,4 +170,20 @@ test_expect_success 'error if encoding round trip is not the same during refresh
 	test_i18ngrep "error: .* overwritten by checkout:" err.out
 '
 
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+	test_when_finished "git reset --hard $BEFORE_STATE" &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out
+'
+
 test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  8 siblings, 0 replies; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 25 +++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index f19a15dd02..5776dfbc99 100644
--- a/convert.c
+++ b/convert.c
@@ -318,6 +318,29 @@ static int validate_encoding(const char *path, const char *enc,
 	return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -346,6 +369,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (validate_encoding(path, enc, src, src_len, die_on_error))
 		return 0;
 
+	trace_encoding("source", path, enc, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc,
 				  &dst_len);
 	if (!dst) {
@@ -363,6 +387,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			return 0;
 		}
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index aa05f82166..6f3a82f61b 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
                   ` (7 preceding siblings ...)
  2018-03-07 17:30 ` [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
@ 2018-03-07 17:30 ` lars.schneider
  2018-03-07 19:59   ` Junio C Hamano
  8 siblings, 1 reply; 32+ messages in thread
From: lars.schneider @ 2018-03-07 17:30 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/config.txt         |  6 ++++
 Documentation/gitattributes.txt  |  8 +++++
 config.c                         |  5 +++
 convert.c                        | 78 ++++++++++++++++++++++++++++++++++++++++
 convert.h                        |  1 +
 environment.c                    |  1 +
 t/t0028-working-tree-encoding.sh | 39 ++++++++++++++++++++
 7 files changed, 138 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
 	This variable can be set to 'input',
 	in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+	A comma separated list of encodings that Git performs UTF-8 round
+	trip checks on if they are used in an `working-tree-encoding`
+	attribute (see linkgit:gitattributes[5]). The default value is
+	`SHIFT-JIS`.
+
 core.symlinks::
 	If false, symbolic links are checked out as small plain files that
 	contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkroundtripencoding")) {
+		check_roundtrip_encoding = xstrdup(value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.notesref")) {
 		notes_ref_name = xstrdup(value);
 		return 0;
diff --git a/convert.c b/convert.c
index 5776dfbc99..e861f1abbc 100644
--- a/convert.c
+++ b/convert.c
@@ -341,6 +341,43 @@ static void trace_encoding(const char *context, const char *path,
 	strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+	/*
+	 * check_roundtrip_encoding contains a string of space and/or
+	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+	 * Search for the given encoding in that string.
+	 */
+	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+	const char *next;
+	int len;
+	if (!found)
+		return 0;
+	next = found + strlen(enc_name);
+	len = strlen(check_roundtrip_encoding);
+	return (found && (
+			/*
+			 * check that the found encoding is at the
+			 * beginning of check_roundtrip_encoding or
+			 * that it is prefixed with a space or comma
+			 */
+			found == check_roundtrip_encoding || (
+				found > check_roundtrip_encoding &&
+				(*(found-1) == ' ' || *(found-1) == ',')
+			)
+		) && (
+			/*
+			 * check that the found encoding is at the
+			 * end of check_roundtrip_encoding or
+			 * that it is suffixed with a space or comma
+			 */
+			next == check_roundtrip_encoding + len || (
+				next < check_roundtrip_encoding + len &&
+				(*next == ' ' || *next == ',')
+			)
+		));
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -389,6 +426,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
+	/*
+	 * UTF supports lossless conversion round tripping [1] and conversions
+	 * between UTF and other encodings are mostly round trip safe as
+	 * Unicode aims to be a superset of all other character encodings.
+	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+	 * trip issues [2]. Check the round trip conversion for all encodings
+	 * listed in core.checkRoundtripEncoding.
+	 *
+	 * The round trip check is only performed if content is written to Git.
+	 * This ensures that no information is lost during conversion to/from
+	 * the internal UTF-8 representation.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without an iconv error. Iconv errors
+	 * are already caught above.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if (die_on_error && check_roundtrip(enc)) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc, default_encoding,
+					     &re_src_len);
+
+		trace_printf("Checking roundtrip encoding for %s...\n", enc);
+		trace_encoding("reencoded source", path, enc,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			die(msg, path, enc, default_encoding);
+		}
+
+		free(re_src);
+	}
+
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
 }
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };
 
 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 6f3a82f61b..7cff41a350 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -188,4 +188,43 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	test_i18ngrep "error: BOM is required" err.out
 '
 
+test_expect_success 'check roundtrip encoding' '
+	test_when_finished "rm -f roundtrip.shift roundtrip.utf16" &&
+	test_when_finished "git reset --hard HEAD" &&
+
+	text="hallo there!\nroundtrip test here!" &&
+	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+	# SHIFT-JIS encoded files are round-trip checked by default...
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# ... unless we overwrite the Git config!
+	! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage \
+		add .gitattributes roundtrip.shift 2>&1 |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# UTF-16 encoded files should not be round-trip checked by default...
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
+		add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for utf-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	# (here we also check that the casing of the encoding is irrelevant)
+	GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
+		add roundtrip.utf16 2>&1 |
+		grep "Checking roundtrip encoding for utf-16" &&
+	git reset
+'
+
 test_done
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute
  2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-03-07 17:54   ` Eric Sunshine
  2018-03-07 22:56     ` Lars Schneider
  2018-03-07 19:35   ` Junio C Hamano
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Sunshine @ 2018-03-07 17:54 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@autodesk.com> wrote:
> [...]
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt  |  80 +++++++++++++++++++++++
> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static const char *default_encoding = "UTF-8";
> @@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> +       const char *value = check->value;
> +
> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +           !strlen(value))
> +               return NULL;
> +
> +       /* Don't encode to the default encoding */
> +       if (!strcasecmp(value, default_encoding))
> +               return NULL;

As of v10, the rest of the code accepts encoding names "UTF-xx" and
"UTFxx" (case insensitive), however, this check recognizes only
"UTF-8" (case insensitive). For consistency, one would expect this
also to recognize "UTF8" (case insensitive).

> +
> +       return value;
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
@ 2018-03-07 18:04   ` Eric Sunshine
  2018-03-09 17:02     ` Lars Schneider
  2018-03-07 19:49   ` Junio C Hamano
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Sunshine @ 2018-03-07 18:04 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Git List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Jeff King, Ramsay Jones, Johannes Schindelin,
	Lars Schneider

On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@autodesk.com> wrote:
> Check that new content is valid with respect to the user defined
> 'working-tree-encoding' attribute.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static int validate_encoding(const char *path, const char *enc,
> +                     const char *data, size_t len, int die_on_error)
> +{
> +       /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +       if (startscase_with(enc, "UTF")) {
> +               /*
> +                * Check for detectable errors in UTF encodings
> +                */
> +               if (has_prohibited_utf_bom(enc, data, len)) {
> +                       const char *error_msg = _(
> +                               "BOM is prohibited in '%s' if encoded as %s");
> +                       /*
> +                        * This advice is shown for UTF-??BE and UTF-??LE encodings.
> +                        * We cut off the last two characters of the encoding name
> +                        # to generate the encoding name suitable for BOMs.

s/#/*/

> +                        */
> +                       const char *advise_msg = _(
> +                               "The file '%s' contains a byte order "
> +                               "mark (BOM). Please use %s as "
> +                               "working-tree-encoding.");
> +                       char *upper_enc = xstrdup_toupper(enc);
> +                       upper_enc[strlen(upper_enc)-2] = '\0';

Due to startscase_with(...,"UTF"), we know at this point that the
string is at least 3 characters long, thus it's safe to back up by 2.
Good.

> +                       advise(advise_msg, path, upper_enc);
> +                       free(upper_enc);
> +                       if (die_on_error)
> +                               die(error_msg, path, enc);
> +                       else {
> +                               return error(error_msg, path, enc);
> +                       }
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
>  for i in 16 32
>  do
> +       test_expect_success "check prohibited UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
> +               echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&

v10 is checking only hyphenated lowercase encoding name; earlier
versions checked uppercase. For better coverage, it would be nice to
check several combinations: all uppercase, all lowercase, mixed case,
hyphenated, not hyphenated.

I'm not suggesting running all the tests repeatedly but rather just
varying the format of the encoding name in these tests you're adding.
For instance, the above could instead be:

    echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
    echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&

or something.

> +               # Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian)
> +               # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32).
> +               # In these cases the BOM is prohibited.
> +               cp bebom.utf${i}be.raw bebom.utf${i}be &&
> +               test_must_fail git add bebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}be &&
> +               test_must_fail git add lebom.utf${i}be 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
> +
> +               cp bebom.utf${i}be.raw bebom.utf${i}le &&
> +               test_must_fail git add bebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&
> +
> +               cp lebom.utf${i}le.raw lebom.utf${i}le &&
> +               test_must_fail git add lebom.utf${i}le 2>err.out &&
> +               test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
> +       '
> +
> +       test_expect_success "check required UTF-${i} BOM" '
> +               test_when_finished "git reset --hard HEAD" &&
> +
> +               echo "*.utf${i} text working-tree-encoding=utf-${i}" >>.gitattributes &&

This is another opportunity for checking a variation (case,
hyphenation) of the encoding name rather than using only hyphenated
lowercase.

> +
> +               cp nobom.utf${i}be.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
> +
> +               cp nobom.utf${i}le.raw nobom.utf${i} &&
> +               test_must_fail git add nobom.utf${i} 2>err.out &&
> +               test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
> +       '

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute
  2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
  2018-03-07 17:54   ` Eric Sunshine
@ 2018-03-07 19:35   ` Junio C Hamano
  1 sibling, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 19:35 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> +static const char *git_path_check_encoding(struct attr_check_item *check)
> +{
> +	const char *value = check->value;
> +
> +	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +	    !strlen(value))
> +		return NULL;

This means that having "* working-tree-encoding" (without "=value")
in your .gitattributes file is silently ignored.  

Thinking aloud.  I wonder if that is something we would want to warn
about so that the project can fix it, perhaps?  Or would that become
too noisy to use an older version of Git that includes this series
when a newer version that defines new meanings to boolean
(set/unset) values for w-t-e attribute becomes available and
projects adopt it?  On the other hand, with this code as-is or with
an additional warning, such an "older version" of Git by definition
behaves differently from such a "newer version" that does something
different when the attribute is not a non-empty string, so it is
quite likely that we won't be able to redefine or extend the meaning
of w-t-e in a way like that.

Which means that the only sensible way to make sure we _could_ later
extend the meaning of w-t-e and make it behave differently when set
to a non-empty string is to make it an error in _this_ "older"
version.  That way, of course people cannot use the "older" version
on a newer project that depends on the way how "newer" Git treats
w-t-e that is set to, say, "true", but we won't silently (or loudly)
do wrong things to the project's data.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
  2018-03-07 18:04   ` Eric Sunshine
@ 2018-03-07 19:49   ` Junio C Hamano
  2018-03-07 22:12     ` Lars Schneider
  1 sibling, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 19:49 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> +static int validate_encoding(const char *path, const char *enc,
> +		      const char *data, size_t len, int die_on_error)
> +{
> +	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +	if (startscase_with(enc, "UTF")) {
> +		/*
> +		 * Check for detectable errors in UTF encodings
> +		 */
> +		if (has_prohibited_utf_bom(enc, data, len)) {
> +			const char *error_msg = _(
> +				"BOM is prohibited in '%s' if encoded as %s");
> +			/*
> +			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
> +			 * We cut off the last two characters of the encoding name
> +			 # to generate the encoding name suitable for BOMs.
> +			 */

Yuck.  The code pretends to abstract away the details in a helper
has_prohibited_x() yet the caller still knows quite a lot.

> +			const char *advise_msg = _(
> +				"The file '%s' contains a byte order "
> +				"mark (BOM). Please use %s as "
> +				"working-tree-encoding.");
> +			char *upper_enc = xstrdup_toupper(enc);
> +			upper_enc[strlen(upper_enc)-2] = '\0';
> +			advise(advise_msg, path, upper_enc);
> +			free(upper_enc);

I think this up-casing is more problematic than without, not from
the point of view of the internal code, but from the point of view
of the end user experience.  When the user writes utf16le or
utf-16le and the data does not trigger the BOM check, we are likely
to successfully convert it.  I do not see the merit of suggesting
UTF16 or UTF-16 in such a case, over telling them to just drop the
byte-order suffix from the encoding names (i.e. utf16 or utf-16).

If you are trying to force/nudge people in the direction of
canonical way of spelling things (which may not be a bad idea), then
"utf16le" as the original input would want to result in "UTF-16"
with dash in the advise, no?

On the other hand, if we are not enforcing such a policy decision
but merely explaining a way to work around this check, then it may
be better to give a variant with the smaller difference from the
original (i.e. without up-casing).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
@ 2018-03-07 19:59   ` Junio C Hamano
  2018-03-07 22:44     ` Lars Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 19:59 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> +static int check_roundtrip(const char* enc_name)

The asterisk sticks to the variable, not type.

> +{
> +	/*
> +	 * check_roundtrip_encoding contains a string of space and/or
> +	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
> +	 * Search for the given encoding in that string.
> +	 */
> +	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
> +	const char *next;
> +	int len;
> +	if (!found)
> +		return 0;
> +	next = found + strlen(enc_name);
> +	len = strlen(check_roundtrip_encoding);
> +	return (found && (
> +			/*
> +			 * check that the found encoding is at the
> +			 * beginning of check_roundtrip_encoding or
> +			 * that it is prefixed with a space or comma
> +			 */
> +			found == check_roundtrip_encoding || (
> +				found > check_roundtrip_encoding &&
> +				(*(found-1) == ' ' || *(found-1) == ',')
> +			)

The second line is unneeded, as we know a non-NULL found either
points at check_roundtrip_encoding or somewhere to the right, and
the first test already checked the "points exactly at" case.

This is defined to be a comma separated list, so it is unnecessary
to accept <cre,en> == <"FOO, SHIFT-JIS, BAR", "SHIFT-JIS">; if you
allow SP, perhaps "isspace(found[-1]) || found[-1] == ','" to also
allow HT may also be appropriate.  I think "comma or whitespace
separated list" is fine; in any case, the comment near the beginning
of this function does not match new text in Documentation/config.txt
added by this patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 19:49   ` Junio C Hamano
@ 2018-03-07 22:12     ` Lars Schneider
  2018-03-07 22:32       ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 22:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin


> On 07 Mar 2018, at 20:49, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +		      const char *data, size_t len, int die_on_error)
>> +{
>> +	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +	if (startscase_with(enc, "UTF")) {
>> +		/*
>> +		 * Check for detectable errors in UTF encodings
>> +		 */
>> +		if (has_prohibited_utf_bom(enc, data, len)) {
>> +			const char *error_msg = _(
>> +				"BOM is prohibited in '%s' if encoded as %s");
>> +			/*
>> +			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
>> +			 * We cut off the last two characters of the encoding name
>> +			 # to generate the encoding name suitable for BOMs.
>> +			 */
> 
> Yuck.  The code pretends to abstract away the details in a helper
> has_prohibited_x() yet the caller still knows quite a lot.

True, but has_prohibited_x() cannot create a proper error/advise
message unless we give it more parameters (e.g. path name).
Therefore, I don't see a better way right now.


>> +			const char *advise_msg = _(
>> +				"The file '%s' contains a byte order "
>> +				"mark (BOM). Please use %s as "
>> +				"working-tree-encoding.");
>> +			char *upper_enc = xstrdup_toupper(enc);
>> +			upper_enc[strlen(upper_enc)-2] = '\0';
>> +			advise(advise_msg, path, upper_enc);
>> +			free(upper_enc);
> 
> I think this up-casing is more problematic than without, not from
> the point of view of the internal code, but from the point of view
> of the end user experience.  When the user writes utf16le or
> utf-16le and the data does not trigger the BOM check, we are likely
> to successfully convert it.  I do not see the merit of suggesting
> UTF16 or UTF-16 in such a case, over telling them to just drop the
> byte-order suffix from the encoding names (i.e. utf16 or utf-16).
> 
> If you are trying to force/nudge people in the direction of
> canonical way of spelling things (which may not be a bad idea), then
> "utf16le" as the original input would want to result in "UTF-16"
> with dash in the advise, no?

Correct. In the error messages I kept the encoding name "as-is" and
only in the advise message I used the uppercase variant to steer
people into the canonical direction. My initial reason for this was
that in is_missing_required_utf_bom() we add "BE/LE" to the encoding
in the advise message. Let's say the user used "Utf-16" as encoding.
 Should "BE/LE" be upper case or lower case? To avoid that question 
I made it always upper case.

I also would have liked to advise "UTF-16" instead of "UTF16" as
you suggested. However, that required a few more lines and I wanted
to keep the change to a minimum. I feel this could be added in a
follow up patch.


> On the other hand, if we are not enforcing such a policy decision
> but merely explaining a way to work around this check, then it may
> be better to give a variant with the smaller difference from the
> original (i.e. without up-casing).

See example mentioned above: "Utf-16". How would you handle that?


Thanks,
Lars


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 22:12     ` Lars Schneider
@ 2018-03-07 22:32       ` Junio C Hamano
  2018-03-07 22:49         ` Lars Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 22:32 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> I also would have liked to advise "UTF-16" instead of "UTF16" as
> you suggested. However, that required a few more lines and I wanted
> to keep the change to a minimum. I feel this could be added in a
> follow up patch.

I'd say the whole upcase thing belongs to such a follow-up patch if
that is the case.

>> On the other hand, if we are not enforcing such a policy decision
>> but merely explaining a way to work around this check, then it may
>> be better to give a variant with the smaller difference from the
>> original (i.e. without up-casing).
>
> See example mentioned above: "Utf-16". How would you handle that?

Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16"
or "Utf16" if the advise message does not force policy, or "UTF-16"
in the canoical form if it does.  Is there a problem?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-07 19:59   ` Junio C Hamano
@ 2018-03-07 22:44     ` Lars Schneider
  2018-03-07 22:52       ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 22:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 07 Mar 2018, at 20:59, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> +static int check_roundtrip(const char* enc_name)
> 
> The asterisk sticks to the variable, not type.

Argh. I need to put this check into Travis CI ;-)


>> +{
>> +	/*
>> +	 * check_roundtrip_encoding contains a string of space and/or
>> +	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
>> +	 * Search for the given encoding in that string.
>> +	 */
>> +	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
>> +	const char *next;
>> +	int len;
>> +	if (!found)
>> +		return 0;
>> +	next = found + strlen(enc_name);
>> +	len = strlen(check_roundtrip_encoding);
>> +	return (found && (
>> +			/*
>> +			 * check that the found encoding is at the
>> +			 * beginning of check_roundtrip_encoding or
>> +			 * that it is prefixed with a space or comma
>> +			 */
>> +			found == check_roundtrip_encoding || (
>> +				found > check_roundtrip_encoding &&
>> +				(*(found-1) == ' ' || *(found-1) == ',')
>> +			)
> 
> The second line is unneeded, as we know a non-NULL found either
> points at check_roundtrip_encoding or somewhere to the right, and
> the first test already checked the "points exactly at" case.

Eric objected that too [1], but noted the documentation value:

    Although the 'found > check_roundtrip_encoding' expression is
    effectively dead code in this case, it does document that the
    programmer didn't just blindly use '*(found-1)' without taking
    boundary conditions into consideration.

    (It's dead code because, at this point, we know that strcasestr()
    didn't return NULL and we know that 'found' doesn't equal
    'check_roundtrip_encoding', therefore it _must_ be greater than
    'check_roundtrip_encoding'.)

[1] https://public-inbox.org/git/CAPig+cR81J3fTOtrgAumAs=RC5hqYFfSmeb-ru-Yf_ahFuBiew@mail.gmail.com/

Although the line is unnecessary, I felt it is safer/easier to 
understand and maintain. Since both of you tripped over it, I will
remove it though.


> This is defined to be a comma separated list, so it is unnecessary
> to accept <cre,en> == <"FOO, SHIFT-JIS, BAR", "SHIFT-JIS">; if you
> allow SP, perhaps "isspace(found[-1]) || found[-1] == ','" to also
> allow HT may also be appropriate.

I don't think HT makes too much sense. However, isspace() is nice
and I will use it. Being more permissive on the inputs should hurt.


>  I think "comma or whitespace
> separated list" is fine; in any case, the comment near the beginning
> of this function does not match new text in Documentation/config.txt
> added by this patch.

Nice catch. Will fix.


Thanks,
Lars


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 22:32       ` Junio C Hamano
@ 2018-03-07 22:49         ` Lars Schneider
  2018-03-07 22:57           ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 22:49 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin


> On 07 Mar 2018, at 23:32, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> I also would have liked to advise "UTF-16" instead of "UTF16" as
>> you suggested. However, that required a few more lines and I wanted
>> to keep the change to a minimum. I feel this could be added in a
>> follow up patch.
> 
> I'd say the whole upcase thing belongs to such a follow-up patch if
> that is the case.
> 
>>> On the other hand, if we are not enforcing such a policy decision
>>> but merely explaining a way to work around this check, then it may
>>> be better to give a variant with the smaller difference from the
>>> original (i.e. without up-casing).
>> 
>> See example mentioned above: "Utf-16". How would you handle that?
> 
> Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16"
> or "Utf16" if the advise message does not force policy, or "UTF-16"
> in the canoical form if it does.  Is there a problem?

In the case of has_prohibited_utf_bom() you are right as we are 
dropping the BE/LE suffix in the advise. However, look at the 
is_missing_required_utf_bom() advise. Here we *add* BE/LE.

At this point I thought it would make sense to make the advised
encoding name uppercase in both situations. OK with you?

- Lars





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-07 22:44     ` Lars Schneider
@ 2018-03-07 22:52       ` Junio C Hamano
  2018-03-07 22:58         ` Lars Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 22:52 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> Although the line is unnecessary, I felt it is safer/easier to 
> understand and maintain. Since both of you tripped over it, I will
> remove it though.

I didn't actually trip over it.  It made it look as if the coder
didn't understand what the code is doing to have that extra line.

> I don't think HT makes too much sense. However, isspace() is nice
> and I will use it. Being more permissive on the inputs should hurt.

You are being incoherent in these three sentences.  If you want to
be more strict and only allow SP, then isspace() is already too
broad, as it does allow HT (and even CR and LF).

I do think HT makes just as much sense as SP, though, so if you use
isspace(), that would be perfectly fine.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute
  2018-03-07 17:54   ` Eric Sunshine
@ 2018-03-07 22:56     ` Lars Schneider
  2018-03-07 22:57       ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 22:56 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 07 Mar 2018, at 18:54, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@autodesk.com> wrote:
>> [...]
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> Documentation/gitattributes.txt  |  80 +++++++++++++++++++++++
>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static const char *default_encoding = "UTF-8";
>> @@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +       const char *value = check->value;
>> +
>> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +           !strlen(value))
>> +               return NULL;
>> +
>> +       /* Don't encode to the default encoding */
>> +       if (!strcasecmp(value, default_encoding))
>> +               return NULL;
> 
> As of v10, the rest of the code accepts encoding names "UTF-xx" and
> "UTFxx" (case insensitive), however, this check recognizes only
> "UTF-8" (case insensitive). For consistency, one would expect this
> also to recognize "UTF8" (case insensitive).

Nice catch. What do you think about this solution using is_encoding_utf8()
from utf.c?
	
	if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) 

- Lars

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 22:49         ` Lars Schneider
@ 2018-03-07 22:57           ` Junio C Hamano
  2018-03-07 23:19             ` Lars Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 22:57 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> In the case of has_prohibited_utf_bom() you are right as we are 
> dropping the BE/LE suffix in the advise. However, look at the 
> is_missing_required_utf_bom() advise. Here we *add* BE/LE.

So?  Then add BE/LE like "Utf-16BE" or "utf16BE".  You do not have
enough information to decide what is best for the user anyway (which
is why I am not convinced that it is even a good idea to always
upcase in the first place).

> At this point I thought it would make sense to make the advised
> encoding name uppercase in both situations. OK with you?

In the endgame, if upcased and properly dashed form is always used,
that would be good (if we are enforcing the policy, which I am not
onboard 100% but it's your code and I do not care too strongly about
it).  I do not see much point in an interim step that only upcases
without doing the dash insertion.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute
  2018-03-07 22:56     ` Lars Schneider
@ 2018-03-07 22:57       ` Junio C Hamano
  0 siblings, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 22:57 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Eric Sunshine, Lars Schneider, Git List,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> Nice catch. What do you think about this solution using is_encoding_utf8()
> from utf.c?

That helper was invented exactly for a case like this, I would
think.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-03-07 22:52       ` Junio C Hamano
@ 2018-03-07 22:58         ` Lars Schneider
  0 siblings, 0 replies; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 22:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 07 Mar 2018, at 23:52, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> I don't think HT makes too much sense. However, isspace() is nice
>> and I will use it. Being more permissive on the inputs should hurt.
> 
> You are being incoherent in these three sentences.  If you want to
> be more strict and only allow SP, then isspace() is already too
> broad, as it does allow HT (and even CR and LF).

I meant "Being more permissive on the inputs shouldN'T hurt." :-)


> I do think HT makes just as much sense as SP, though, so if you use
> isspace(), that would be perfectly fine.

OK!

- Lars


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 22:57           ` Junio C Hamano
@ 2018-03-07 23:19             ` Lars Schneider
  2018-03-07 23:34               ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-07 23:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin


> On 07 Mar 2018, at 23:57, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>> At this point I thought it would make sense to make the advised
>> encoding name uppercase in both situations. OK with you?
> 
> In the endgame, if upcased and properly dashed form is always used,
> that would be good (if we are enforcing the policy, which I am not
> onboard 100% but it's your code and I do not care too strongly about
> it).  I do not see much point in an interim step that only upcases
> without doing the dash insertion.

I would like to advise the dashed form as this seems to be the
canonical form and it avoids cross platform issues. My macOS
iconv does not support the form without dashes.

Would this approach work for you?

			const char *advise_msg = _(
				"The file '%s' contains a byte order "
				"mark (BOM). Please use UTF-%s as "
				"working-tree-encoding.");
			const char *stripped;
			char *upper = xstrdup_toupper(enc);
			upper[strlen(upper)-2] = '\0';
			skip_prefix(upper, "UTF-", &stripped) ||
			skip_prefix(stripped, "UTF", &stripped);
			advise(advise_msg, path, stripped);

Thanks,
Lars

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 23:19             ` Lars Schneider
@ 2018-03-07 23:34               ` Junio C Hamano
  0 siblings, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2018-03-07 23:34 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	sunshine, peff, ramsay, Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> I would like to advise the dashed form as this seems to be the
> canonical form and it avoids cross platform issues. My macOS
> iconv does not support the form without dashes.

Sure, that is why I said canonicalization without inserting dash
does not make much sense, hence an interim step with only upcasing
is not a good idea.  A possible interim solution would be to do
nothing (no dash insertion, no upcasing) and fixing both in a later
follow-up patch, but as I said, I do not care too strongly either
way.

> Would this approach work for you?
>
> 			const char *advise_msg = _(
> 				"The file '%s' contains a byte order "
> 				"mark (BOM). Please use UTF-%s as "
> 				"working-tree-encoding.");
> 			const char *stripped;
> 			char *upper = xstrdup_toupper(enc);
> 			upper[strlen(upper)-2] = '\0';
> 			skip_prefix(upper, "UTF-", &stripped) ||
> 			skip_prefix(stripped, "UTF", &stripped);
> 			advise(advise_msg, path, stripped);

Are you now interested in not having any interim step and jump
directly to the endgame solution?  If so, that is fine by me, too,
but as I already said earlier (i.e. not doing this BOM check for an
encoding that is not spelled in your canonical upcase-with-dash form
might be a feature that leaves an escape hatch), I am not all that
interested in enforcing policy at this point in the codepath to
begin with, so...


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
@ 2018-03-08  0:31   ` Duy Nguyen
  2018-03-08 23:12     ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Duy Nguyen @ 2018-03-08  0:31 UTC (permalink / raw)
  To: lars.schneider
  Cc: Git Mailing List, Junio C Hamano, Torsten Bögershausen,
	Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes Schindelin, Lars Schneider

On Thu, Mar 8, 2018 at 12:30 AM,  <lars.schneider@autodesk.com> wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Check in a case insensitive manner if one string is a prefix of another
> string.
>
> This function is used in a subsequent commit.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  git-compat-util.h | 1 +
>  strbuf.c          | 9 +++++++++
>  2 files changed, 10 insertions(+)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index 68b2ad531e..f648da0c11 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params);
>  extern void set_die_is_recursing_routine(int (*routine)(void));
>
>  extern int starts_with(const char *str, const char *prefix);
> +extern int startscase_with(const char *str, const char *prefix);

This name is a bit hard to read. Boost [1] goes with istarts_with. I
wonder if it's better. If not I guess either starts_with_case or
starts_case_with will improve readability.

[1] http://www.boost.org/doc/libs/1_41_0/doc/html/boost/algorithm/istarts_with.html

>
>  /*
>   * If the string "str" begins with the string found in "prefix", return 1.
> diff --git a/strbuf.c b/strbuf.c
> index b635f0bdc4..5779a2d591 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
>                         return 0;
>  }
>
> +int startscase_with(const char *str, const char *prefix)
> +{
> +       for (; ; str++, prefix++)
> +               if (!*prefix)
> +                       return 1;
> +               else if (tolower(*str) != tolower(*prefix))
> +                       return 0;
> +}
> +
>  int skip_to_optional_arg_default(const char *str, const char *prefix,
>                                  const char **arg, const char *def)
>  {
> --
> 2.16.2
>



-- 
Duy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-08  0:31   ` Duy Nguyen
@ 2018-03-08 23:12     ` Junio C Hamano
  2018-03-09 15:54       ` Lars Schneider
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-08 23:12 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: lars.schneider, Git Mailing List, Torsten Bögershausen,
	Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
	Johannes Schindelin, Lars Schneider

Duy Nguyen <pclouds@gmail.com> writes:

>>  extern int starts_with(const char *str, const char *prefix);
>> +extern int startscase_with(const char *str, const char *prefix);
>
> This name is a bit hard to read. Boost [1] goes with istarts_with. I
> wonder if it's better. If not I guess either starts_with_case or
> starts_case_with will improve readability.

starts_with_case() sounds quite strange even though
starts_with_icase() may make it clear that it is "a variant of
starts_with() function that ignores case".  I dunno.
dir.c::cmp_icase() takes the _icase suffix for not quite the way
that is consistent with that understanding, though.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-08 23:12     ` Junio C Hamano
@ 2018-03-09 15:54       ` Lars Schneider
  2018-03-09 17:20         ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Schneider @ 2018-03-09 15:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Lars Schneider, Git Mailing List,
	Torsten Bögershausen, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes Schindelin


> On 09 Mar 2018, at 00:12, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Duy Nguyen <pclouds@gmail.com> writes:
> 
>>> extern int starts_with(const char *str, const char *prefix);
>>> +extern int startscase_with(const char *str, const char *prefix);
>> 
>> This name is a bit hard to read. Boost [1] goes with istarts_with. I
>> wonder if it's better. If not I guess either starts_with_case or
>> starts_case_with will improve readability.
> 
> starts_with_case() sounds quite strange even though
> starts_with_icase() may make it clear that it is "a variant of
> starts_with() function that ignores case".  I dunno.
> dir.c::cmp_icase() takes the _icase suffix for not quite the way
> that is consistent with that understanding, though.

I think following the boost lib makes most sense. Therefore,
I would like to go with "istarts_with". OK with you?

- Lars

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
  2018-03-07 18:04   ` Eric Sunshine
@ 2018-03-09 17:02     ` Lars Schneider
  0 siblings, 0 replies; 32+ messages in thread
From: Lars Schneider @ 2018-03-09 17:02 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Lars Schneider, Git List, Junio C Hamano,
	Torsten Bögershausen, Johannes Sixt, Jeff King, Ramsay Jones,
	Johannes Schindelin


> On 07 Mar 2018, at 19:04, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@autodesk.com> wrote:
>> Check that new content is valid with respect to the user defined
>> 'working-tree-encoding' attribute.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static int validate_encoding(const char *path, const char *enc,
>> +                     const char *data, size_t len, int die_on_error)
>> +{
>> +       /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +       if (startscase_with(enc, "UTF")) {
>> +               /*
>> +                * Check for detectable errors in UTF encodings
>> +                */
>> +               if (has_prohibited_utf_bom(enc, data, len)) {
>> +                       const char *error_msg = _(
>> +                               "BOM is prohibited in '%s' if encoded as %s");
>> +                       /*
>> +                        * This advice is shown for UTF-??BE and UTF-??LE encodings.
>> +                        * We cut off the last two characters of the encoding name
>> +                        # to generate the encoding name suitable for BOMs.
> 
> s/#/*/

Of course!


>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
>> for i in 16 32
>> do
>> +       test_expect_success "check prohibited UTF-${i} BOM" '
>> +               test_when_finished "git reset --hard HEAD" &&
>> +
>> +               echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
>> +               echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&
> 
> v10 is checking only hyphenated lowercase encoding name; earlier
> versions checked uppercase. For better coverage, it would be nice to
> check several combinations: all uppercase, all lowercase, mixed case,
> hyphenated, not hyphenated.
> 
> I'm not suggesting running all the tests repeatedly but rather just
> varying the format of the encoding name in these tests you're adding.
> For instance, the above could instead be:
> 
>    echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
>    echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&
> 
> or something.

The casing is a good idea - I will do that. I don't want to do "hyphenated, not 
hyphenated" as this would make the tests fail on macOS (and I believe on Windows).


Thanks,
Lars

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-09 15:54       ` Lars Schneider
@ 2018-03-09 17:20         ` Junio C Hamano
  2018-03-09 19:06           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-03-09 17:20 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Duy Nguyen, Lars Schneider, Git Mailing List,
	Torsten Bögershausen, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> I think following the boost lib makes most sense. Therefore,
> I would like to go with "istarts_with". OK with you?

I don't care too deeply; if we took starts_with() from there, where
what we now want is defined as istarts_with(), then that sounds like
a good thing to do.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
  2018-03-09 17:20         ` Junio C Hamano
@ 2018-03-09 19:06           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-03-09 19:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, Duy Nguyen, Lars Schneider, Git Mailing List,
	Torsten Bögershausen, Johannes Sixt, Eric Sunshine,
	Jeff King, Ramsay Jones, Johannes Schindelin


On Fri, Mar 09 2018, Junio C. Hamano jotted:

> Lars Schneider <larsxschneider@gmail.com> writes:
>
>> I think following the boost lib makes most sense. Therefore,
>> I would like to go with "istarts_with". OK with you?
>
> I don't care too deeply; if we took starts_with() from there, where
> what we now want is defined as istarts_with(), then that sounds like
> a good thing to do.

I don't care either, but just a note that we had this exact discussion
around this time last year when I added a starts_with_icase() [1][2]
which eventually got dropped.

1. https://public-inbox.org/git/xmqqpohao2hw.fsf@gitster.mtv.corp.google.com/
2. https://public-inbox.org/git/20170326121654.22035-4-avarab@gmail.com/

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-03-09 19:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-07 17:30 [PATCH v10 0/9] convert: add support for different encodings lars.schneider
2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-03-07 17:30 ` [PATCH v10 2/9] strbuf: add xstrdup_toupper() lars.schneider
2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
2018-03-08  0:31   ` Duy Nguyen
2018-03-08 23:12     ` Junio C Hamano
2018-03-09 15:54       ` Lars Schneider
2018-03-09 17:20         ` Junio C Hamano
2018-03-09 19:06           ` Ævar Arnfjörð Bjarmason
2018-03-07 17:30 ` [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-03-07 17:30 ` [PATCH v10 5/9] utf8: add function to detect a missing " lars.schneider
2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:54   ` Eric Sunshine
2018-03-07 22:56     ` Lars Schneider
2018-03-07 22:57       ` Junio C Hamano
2018-03-07 19:35   ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
2018-03-07 18:04   ` Eric Sunshine
2018-03-09 17:02     ` Lars Schneider
2018-03-07 19:49   ` Junio C Hamano
2018-03-07 22:12     ` Lars Schneider
2018-03-07 22:32       ` Junio C Hamano
2018-03-07 22:49         ` Lars Schneider
2018-03-07 22:57           ` Junio C Hamano
2018-03-07 23:19             ` Lars Schneider
2018-03-07 23:34               ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-03-07 19:59   ` Junio C Hamano
2018-03-07 22:44     ` Lars Schneider
2018-03-07 22:52       ` Junio C Hamano
2018-03-07 22:58         ` Lars Schneider

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).