[PATCH v2 0/5] convert: add support for different encodings

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2 0/5] convert: add support for different encodings
@ 2017-12-29 15:22 lars.schneider
  2017-12-29 15:22 ` [PATCH v2 1/5] strbuf: add xstrdup_toupper() lars.schneider
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

notable changes since v1:

* In [1] Peff described a situation where you "couldn't checkout _away_
  from" problems because of "die() in convert_to_git()". I fixed this and
  tried to replicate the situation with the test "error if encoding
  garbage is already in Git". To achieve that I had to pass the
  'write_obj' variable through all the functions to let the encoding
  code know if Git actually writes content. If Git actually writes
  content then we die. Otherwise we just print an error. This ensures
  no garbage is added to Git.

* I renamed the attribute to "checkout-encoding" to avoid trouble with
  the existing "encoding" attribute. I also considered
  "working-tree-encoding". We also discussed "worktree-encoding" as
  attribute name but I am was worried that people could (wrongly)
  associated that with the 'git worktree' feature. I am happy to hear
  arguments for/against any of the different options.

* I read a bit on the Internet and as far as I understand it to/from
  UTF-8 reencodings are mostly round trip safe. Notable exceptions
  are some characters in the SHIFT-JIS character set:
  https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

  However, I was not able to replicate the issue. Consider this:
      X="\x87\x90"
      printf "$X" | od -h
      printf "$X" | iconv -f SHIFT-JIS -t UTF-8 | iconv -f UTF-8 -t SHIFT-JIS | od -h

  ... the reencoding from SHIFT-JIS to UTF-8 would always fail.
  Therefore, I was not able to generate a test case that produces a
  different result after the round trip. Anyone an idea how to do that?

* Patch 1, 2, 3, and 5 are helper functions. Patch 4 is the real change.

Thanks,
Lars

     RFC: ttps://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.comh
      v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/

Base Ref: master
Web-Diff: https://github.com/larsxschneider/git/commit/e11e375fdb
Checkout: git fetch https://github.com/larsxschneider/git encoding-v2 && git checkout e11e375fdb


### Interdiff (v1..v2):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 84de2fa49c..0039bd38c3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -146,33 +146,6 @@ Unspecified::
 Any other value causes Git to act as if `text` has been left
 unspecified.

-`encoding`
-^^^^^^^^^^
-
-By default Git assumes UTF-8 encoding for text files.  The `encoding`
-attribute sets the encoding to be used in the working directory.
-If the path is added to the index, then Git encodes the content to
-UTF-8.  On checkout the content is encoded back to the original
-encoding.  Consequently, you can use all built-in Git text processing
-tools (e.g. git diff, line ending conversions, etc.) with your
-non-UTF-8 encoded file.
-
-Please note that re-encoding content can cause errors and requires
-resources. Use the `encoding` attribute only if you cannot store
-a file in UTF-8 encoding and if you want Git to be able to process
-the text.
-
-------------------------
-*.txt		text encoding=UTF-16
-------------------------
-
-All `iconv` encodings with a stable round-trip conversion to and from
-UTF-8 are supported.  You can see a full list with the following command:
-
-------------------------
-iconv --list
-------------------------
-
 `eol`
 ^^^^^

@@ -299,6 +272,65 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.


+`checkout-encoding`
+^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure. On checkout the content is encoded back to the specified
+encoding.
+
+Please note that using the `checkout-encoding` attribute has a number
+of drawbacks:
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+- Git clients that do not support the `checkout-encoding` attribute or
+  the used encoding will checkout the respective files as UTF-8 encoded.
+  That means the content appears to be different which could cause
+  trouble. Affected clients are older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of January 2018).
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text checkout-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^

diff --git a/apply.c b/apply.c
index 321a9fa68d..c4bd5cf1f2 100644
--- a/apply.c
+++ b/apply.c
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
 		return 0;
 	default:
 		return -1;
diff --git a/blame.c b/blame.c
index 2893f3c103..388b66897b 100644
--- a/blame.c
+++ b/blame.c
@@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
 		if (strbuf_read(&buf, 0, 0) < 0)
 			die_errno("failed to read from stdin");
 	}
-	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0);
+	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0, 0);
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
 	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..4555e49b5f 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 			if (is_file) {
 				struct strbuf buf = STRBUF_INIT;

-				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
+				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf, 0)) {
 					free(result);
 					result = strbuf_detach(&buf, &len);
 					result_size = len;
diff --git a/convert.c b/convert.c
index ee19c17104..ca7b2f3e5c 100644
--- a/convert.c
+++ b/convert.c
@@ -257,28 +257,40 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,

 }

-#ifdef NO_ICONV
-#ifndef _ICONV_T
-/* The type is just a placeholder and not actually used. */
-typedef void* iconv_t;
-#endif
-#endif
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}

 static struct encoding {
 	const char *name;
-	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
-	iconv_t to_worktree;  /* conversion to user-defined encoding */
 	struct encoding *next;
 } *encoding, **encoding_tail;
-static const char *default_encoding = "utf-8";
-static iconv_t invalid_conversion = (iconv_t)-1;
+static const char *default_encoding = "UTF-8";

 static int encode_to_git(const char *path, const char *src, size_t src_len,
-			 struct strbuf *buf, struct encoding *enc)
+			 struct strbuf *buf, struct encoding *enc, int write_obj)
 {
-#ifndef NO_ICONV
-	char *dst, *re_src;
-	int dst_len, re_src_len;
+	char *dst;
+	int dst_len;

 	/*
 	 * No encoding is specified or there is nothing to encode.
@@ -296,70 +308,102 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	if (!buf && !src)
 		return 1;

-	if (enc->to_git == invalid_conversion) {
-		enc->to_git = iconv_open(default_encoding, encoding->name);
-		if (enc->to_git == invalid_conversion)
-			warning(_("unsupported encoding %s"), encoding->name);
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as checkout encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (write_obj)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as checkout encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (write_obj)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
 	}

-	if (enc->to_worktree == invalid_conversion)
-		enc->to_worktree = iconv_open(encoding->name, default_encoding);
-
-	/*
-	 * Do nothing if the encoding is not supported. This could happen in
-	 * two cases:
-	 *   (1) The encoding is garbage. That is no problem as we would not
-	 *       encode the content to UTF-8 on "add" and we would not encode
-	 *       it back on "checkout".
-	 *   (2) Git users use different iconv versions that support different
-	 *       encodings. This could lead to problems, as the content might
-	 *       not be encoded on "add" but encoded back on "checkout" (or
-	 *       the other way around).
-	 * We print a one-time warning to the user in both cases above.
-	 */
-	if (enc->to_git == invalid_conversion || enc->to_worktree == invalid_conversion)
-		return 0;
-
-	dst = reencode_string_iconv(src, src_len, enc->to_git, &dst_len);
-	if (!dst)
+	trace_encoding("source", path, enc->name, src, src_len);
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
 		/*
 		 * We could add the blob "as-is" to Git. However, on checkout
 		 * we would try to reencode to the original encoding. This
 		 * would fail and we would leave the user with a messed-up
 		 * working tree. Let's try to avoid this by screaming loud.
 		 */
-		die(_("failed to encode '%s' from %s to %s"),
-			path, enc->name, default_encoding);
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (write_obj)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);

 	/*
-	 * Encode dst back to ensure no information is lost. This wastes
-	 * a few cycles as most conversions are round trip conversion
-	 * safe. However, content that has an invalid encoding might not
-	 * match its original byte sequence after the UTF-8 conversion
-	 * round trip. Let's play safe here and check the round trip
-	 * conversion.
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
 	 */
-	re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
-	if (!re_src || strcmp(src, re_src)) {
-		die(_("encoding '%s' from %s to %s and back is not the same"),
-			path, enc->name, default_encoding);
+	if (write_obj && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (write_obj)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
 		}
+
 		free(re_src);
+	}

 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
-#else
-	warning(_("cannot encode '%s' from %s to %s because "
-		"your Git was not compiled with encoding support"),
-		path, enc->name, default_encoding);
-	return 0;
-#endif
 }

 static int encode_to_worktree(const char *path, const char *src, size_t src_len,
 			      struct strbuf *buf, struct encoding *enc)
 {
-#ifndef NO_ICONV
 	char *dst;
 	int dst_len;

@@ -370,34 +414,16 @@ static int encode_to_worktree(const char *path, const char *src, size_t src_len,
 	if (!enc || (src && !src_len))
 		return 0;

-	if (enc->to_worktree == invalid_conversion) {
-		enc->to_worktree = iconv_open(encoding->name, default_encoding);
-		if (enc->to_worktree == invalid_conversion)
-			warning("unsupported encoding %s", encoding->name);
-	}
-
-	/*
-	 * Do nothing if the encoding is not supported.
-	 * See detailed explanation in encode_to_git().
-	 */
-	if (enc->to_worktree == invalid_conversion)
-		return 0;
-
-	dst = reencode_string_iconv(src, src_len, enc->to_worktree, &dst_len);
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
 	if (!dst) {
-		warning("failed to encode '%s' from %s to %s",
-			path, default_encoding, encoding->name);
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
 		return 0;
 	}

 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
-#else
-	warning(_("cannot encode '%s' from %s to %s because "
-		"your Git was not compiled with encoding support"),
-		path, default_encoding, encoding->name);
-	return 0;
-#endif
 }

 static int crlf_to_git(const struct index_state *istate,
@@ -1118,11 +1144,12 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
 	const char *value = check->value;
 	struct encoding *enc;

-	if (ATTR_UNSET(value))
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
 		return NULL;

 	for (enc = encoding; enc; enc = enc->next)
-		if (!strcmp(value, enc->name))
+		if (!strcasecmp(value, enc->name))
 			return enc;

 	/* Don't encode to the default encoding */
@@ -1130,9 +1157,7 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
 		return NULL;

 	enc = xcalloc(1, sizeof(struct convert_driver));
-	enc->name = xstrdup(value);
-	enc->to_git = invalid_conversion;
-	enc->to_worktree = invalid_conversion;
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
 	*encoding_tail = enc;
 	encoding_tail = &(enc->next);

@@ -1194,7 +1219,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
-	struct encoding *encoding; /* Supported encoding or default encoding if NULL */
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1203,7 +1228,8 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)

 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", "encoding", NULL);
+					 "eol", "text", "checkout-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
 		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
@@ -1227,7 +1253,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
-		ca->encoding = git_path_check_encoding(ccheck + 5);
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1293,7 +1319,7 @@ const char *get_convert_attr_ascii(const char *path)

 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, enum safe_crlf checksafe, int write_obj)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1309,7 +1335,7 @@ int convert_to_git(const struct index_state *istate,
 		len = dst->len;
 	}

-	ret |= encode_to_git(path, src, len, dst, ca.encoding);
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, write_obj);
 	if (ret && dst) {
 		src = dst->buf;
 		len = dst->len;
@@ -1327,7 +1353,7 @@ int convert_to_git(const struct index_state *istate,

 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      enum safe_crlf checksafe, int write_obj)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
@@ -1338,7 +1364,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

-	encode_to_git(path, dst->buf, dst->len, dst, ca.encoding);
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, write_obj);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1370,7 +1396,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}

-	ret |= encode_to_worktree(path, src, len, dst, ca.encoding);
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
 	if (ret) {
 		src = dst->buf;
 		len = dst->len;
@@ -1404,7 +1430,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
 		src = dst->buf;
 		len = dst->len;
 	}
-	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
+	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE, 0);
 }

 /*****************************************************************
@@ -1842,7 +1868,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;

-	if (ca.encoding)
+	if (ca.checkout_encoding)
 		return NULL;

 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/convert.h b/convert.h
index 4f2da225a8..9e4e884ec1 100644
--- a/convert.h
+++ b/convert.h
@@ -66,7 +66,8 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, enum safe_crlf checksafe,
+			  int write_obj);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -79,13 +80,14 @@ extern int renormalize_buffer(const struct index_state *istate,
 static inline int would_convert_to_git(const struct index_state *istate,
 				       const char *path)
 {
-	return convert_to_git(istate, path, NULL, 0, NULL, 0);
+	return convert_to_git(istate, path, NULL, 0, NULL, 0, 0);
 }
 /* Precondition: would_convert_to_git_filter_fd(path) == true */
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     enum safe_crlf checksafe,
+				     int write_obj);
 extern int would_convert_to_git_filter_fd(const char *path);

 /*****************************************************************
diff --git a/diff.c b/diff.c
index 2ebe2227b4..16ca0bf0df 100644
--- a/diff.c
+++ b/diff.c
@@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn, 0)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/sha1_file.c b/sha1_file.c
index afe4b90f6e..75800248d2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1694,7 +1694,8 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_safe_crlf(flags),
+				   write_object)) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1728,7 +1729,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));

 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_safe_crlf(flags), write_object);

 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
diff --git a/strbuf.c b/strbuf.c
index 323c49ceb3..54276e96e7 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -760,6 +760,19 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }

+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	result[i] = '\0';
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 0a74acb236..2bc148526f 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -616,6 +616,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);

 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);

 /**
  * Create a newly allocated string using printf format. You can do this easily
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
new file mode 100755
index 0000000000..df3cc91269
--- /dev/null
+++ b/t/t0028-checkout-encoding.sh
@@ -0,0 +1,199 @@
+#!/bin/sh
+
+test_description='checkout-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
+test_expect_success 'setup test repo' '
+
+	text="hallo there!\ncan you read me?" &&
+
+	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
+
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
deleted file mode 100755
index 24f24f9e0d..0000000000
--- a/t/t0028-encoding.sh
+++ /dev/null
@@ -1,60 +0,0 @@
-#!/bin/sh
-
-test_description='encoding conversion via gitattributes'
-
-. ./test-lib.sh
-
-test_expect_success 'setup test repo' '
-
-	text="hallo there!\ncan you read me?" &&
-
-	echo "*.utf16 text encoding=utf-16" >.gitattributes &&
-	printf "$text" >t.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >t.utf16.raw &&
-	cp t.utf16.raw t.utf16 &&
-
-	git add .gitattributes t.utf16.raw t.utf16 &&
-	git commit -m initial
-'
-
-test_expect_success 'ensure UTF-8 is stored in Git' '
-	git cat-file -p :t.utf16 >t.utf16.git &&
-	test_cmp_bin t.utf8.raw t.utf16.git
-'
-
-test_expect_success 're-encode to UTF-16 on checkout' '
-	rm t.utf16 &&
-	git checkout t.utf16 &&
-	test_cmp_bin t.utf16.raw t.utf16
-'
-
-test_expect_success 'warn if an unsupported encoding is used' '
-	echo "*.garbage text encoding=garbage" >>.gitattributes &&
-	printf "garbage" >t.garbage &&
-	git add t.garbage 2>error.out &&
-	test_i18ngrep "warning: unsupported encoding" error.out &&
-
-	# cleanup
-	git reset --hard HEAD
-'
-
-test_expect_success 'fail if files with invalid encoding are added' '
-	printf "\0\0h\0a" >error.utf16 &&
-	# The test string encoding would fail
-	# test_must_fail iconv -f utf-16 -t utf-8 error.utf16 &&
-	test_must_fail git add error.utf16
-'
-
-# Some sequences might trigger errno == E2BIG in reencode_string_iconv, utf.8.
-# This would cause no error on "git add" and, consequently, the Git internal
-# UTF-8 encoded blob would contain garbage. Hence, the worktree file after a
-# checkout would contain garbage, too. This garbage would not match the file
-# that was initially added.
-test_expect_success 'fail if encoding from X to UTF-8 and back to X is not the same' '
-	printf "\xc3\x28" >error.utf16 &&
-	# The test string re-encoding would fail
-	# iconv -f utf-16 -t utf-8 error.utf16 | iconv -f utf-8 -t utf-16 &&
-	test_must_fail git add error.utf16
-'
-
-test_done
diff --git a/utf8.c b/utf8.c
index 2c27ce0137..1978d6c42a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,43 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif

+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+const char utf16_be_bom[] = {0xFE, 0xFF};
+const char utf16_le_bom[] = {0xFF, 0xFE};
+const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,29 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);

+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif


### Patches

Lars Schneider (5):
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add support for 'checkout-encoding' attribute
  convert: add tracing for checkout-encoding

 Documentation/gitattributes.txt |  59 +++++++++++
 apply.c                         |   2 +-
 blame.c                         |   2 +-
 combine-diff.c                  |   2 +-
 convert.c                       | 224 +++++++++++++++++++++++++++++++++++++++-
 convert.h                       |   8 +-
 diff.c                          |   2 +-
 sha1_file.c                     |   5 +-
 strbuf.c                        |  13 +++
 strbuf.h                        |   1 +
 t/t0028-checkout-encoding.sh    | 199 +++++++++++++++++++++++++++++++++++
 utf8.c                          |  37 +++++++
 utf8.h                          |  25 +++++
 13 files changed, 566 insertions(+), 13 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh


base-commit: 95ec6b1b3393eb6e26da40c565520a8db9796e9f
--
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 1/5] strbuf: add xstrdup_toupper()
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
@ 2017-12-29 15:22 ` lars.schneider
  2017-12-29 15:56   ` Torsten Bögershausen
  2017-12-29 15:22 ` [PATCH v2 2/5] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 13 +++++++++++++
 strbuf.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 323c49ceb3..54276e96e7 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -760,6 +760,19 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	result[i] = '\0';
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 0a74acb236..2bc148526f 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -616,6 +616,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/5] utf8: add function to detect prohibited UTF-16/32 BOM
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
  2017-12-29 15:22 ` [PATCH v2 1/5] strbuf: add xstrdup_toupper() lars.schneider
@ 2017-12-29 15:22 ` lars.schneider
  2017-12-29 15:22 ` [PATCH v2 3/5] utf8: add function to detect a missing " lars.schneider
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..776660ee12 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+const char utf16_be_bom[] = {0xFE, 0xFF};
+const char utf16_le_bom[] = {0xFF, 0xFE};
+const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/5] utf8: add function to detect a missing UTF-16/32 BOM
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
  2017-12-29 15:22 ` [PATCH v2 1/5] strbuf: add xstrdup_toupper() lars.schneider
  2017-12-29 15:22 ` [PATCH v2 2/5] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2017-12-29 15:22 ` lars.schneider
  2017-12-29 15:22 ` [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute lars.schneider
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 776660ee12..1978d6c42a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2017-12-29 15:22 ` [PATCH v2 3/5] utf8: add function to detect a missing " lars.schneider
@ 2017-12-29 15:22 ` lars.schneider
  2017-12-29 17:28   ` Torsten Bögershausen
  2017-12-29 15:22 ` [PATCH v2 5/5] convert: add tracing for checkout-encoding lars.schneider
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt |  59 ++++++++++++
 apply.c                         |   2 +-
 blame.c                         |   2 +-
 combine-diff.c                  |   2 +-
 convert.c                       | 196 ++++++++++++++++++++++++++++++++++++++-
 convert.h                       |   8 +-
 diff.c                          |   2 +-
 sha1_file.c                     |   5 +-
 t/t0028-checkout-encoding.sh    | 197 ++++++++++++++++++++++++++++++++++++++++
 9 files changed, 460 insertions(+), 13 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..0039bd38c3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,65 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`checkout-encoding`
+^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure. On checkout the content is encoded back to the specified
+encoding.
+
+Please note that using the `checkout-encoding` attribute has a number
+of drawbacks:
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+- Git clients that do not support the `checkout-encoding` attribute or
+  the used encoding will checkout the respective files as UTF-8 encoded.
+  That means the content appears to be different which could cause
+  trouble. Affected clients are older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of January 2018).
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text checkout-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/apply.c b/apply.c
index 321a9fa68d..c4bd5cf1f2 100644
--- a/apply.c
+++ b/apply.c
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
 		return 0;
 	default:
 		return -1;
diff --git a/blame.c b/blame.c
index 2893f3c103..388b66897b 100644
--- a/blame.c
+++ b/blame.c
@@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
 		if (strbuf_read(&buf, 0, 0) < 0)
 			die_errno("failed to read from stdin");
 	}
-	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0);
+	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0, 0);
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
 	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..4555e49b5f 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 			if (is_file) {
 				struct strbuf buf = STRBUF_INIT;
 
-				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
+				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf, 0)) {
 					free(result);
 					result = strbuf_detach(&buf, &len);
 					result_size = len;
diff --git a/convert.c b/convert.c
index 20d7ab67bd..fc8c96b670 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -256,6 +257,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int write_obj)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as checkout encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (write_obj)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as checkout encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (write_obj)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (write_obj)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	/*
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if (write_obj && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (write_obj)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -969,6 +1111,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1024,6 +1191,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1032,8 +1200,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "checkout-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1055,6 +1225,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1120,7 +1291,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, enum safe_crlf checksafe, int write_obj)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1135,6 +1306,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, write_obj);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
 		if (ret && dst) {
@@ -1147,7 +1325,7 @@ int convert_to_git(const struct index_state *istate,
 
 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      enum safe_crlf checksafe, int write_obj)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
@@ -1158,6 +1336,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, write_obj);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1189,6 +1368,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1217,7 +1402,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
 		src = dst->buf;
 		len = dst->len;
 	}
-	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
+	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE, 0);
 }
 
 /*****************************************************************
@@ -1655,6 +1840,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.checkout_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 4f2da225a8..9e4e884ec1 100644
--- a/convert.h
+++ b/convert.h
@@ -66,7 +66,8 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, enum safe_crlf checksafe,
+			  int write_obj);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -79,13 +80,14 @@ extern int renormalize_buffer(const struct index_state *istate,
 static inline int would_convert_to_git(const struct index_state *istate,
 				       const char *path)
 {
-	return convert_to_git(istate, path, NULL, 0, NULL, 0);
+	return convert_to_git(istate, path, NULL, 0, NULL, 0, 0);
 }
 /* Precondition: would_convert_to_git_filter_fd(path) == true */
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     enum safe_crlf checksafe,
+				     int write_obj);
 extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
diff --git a/diff.c b/diff.c
index 2ebe2227b4..16ca0bf0df 100644
--- a/diff.c
+++ b/diff.c
@@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn, 0)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/sha1_file.c b/sha1_file.c
index afe4b90f6e..75800248d2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1694,7 +1694,8 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_safe_crlf(flags),
+				   write_object)) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1728,7 +1729,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));
 
 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_safe_crlf(flags), write_object);
 
 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
new file mode 100755
index 0000000000..1a329ab933
--- /dev/null
+++ b/t/t0028-checkout-encoding.sh
@@ -0,0 +1,197 @@
+#!/bin/sh
+
+test_description='checkout-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+
+	text="hallo there!\ncan you read me?" &&
+
+	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
+
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/5] convert: add tracing for checkout-encoding
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2017-12-29 15:22 ` [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute lars.schneider
@ 2017-12-29 15:22 ` lars.schneider
  2017-12-31  8:05 ` [PATCH 0/5] V2B: simplify convert.c/h tboegi
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: lars.schneider @ 2017-12-29 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, tboegi, j6t, sunshine, peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the checkout-encoding
attribute.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                    | 28 ++++++++++++++++++++++++++++
 t/t0028-checkout-encoding.sh |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/convert.c b/convert.c
index fc8c96b670..ca7b2f3e5c 100644
--- a/convert.c
+++ b/convert.c
@@ -257,6 +257,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -316,6 +339,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -331,6 +355,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	/*
 	 * UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -356,6 +381,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 					     enc->name, default_encoding,
 					     &re_src_len);
 
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
 		if (!re_src || src_len != re_src_len ||
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
index 1a329ab933..df3cc91269 100755
--- a/t/t0028-checkout-encoding.sh
+++ b/t/t0028-checkout-encoding.sh
@@ -4,6 +4,8 @@ test_description='checkout-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
 test_expect_success 'setup test repo' '
 
 	text="hallo there!\ncan you read me?" &&
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] strbuf: add xstrdup_toupper()
  2017-12-29 15:22 ` [PATCH v2 1/5] strbuf: add xstrdup_toupper() lars.schneider
@ 2017-12-29 15:56   ` Torsten Bögershausen
  2017-12-29 16:55     ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Torsten Bögershausen @ 2017-12-29 15:56 UTC (permalink / raw)
  To: lars.schneider; +Cc: git, gitster, j6t, sunshine, peff, Lars Schneider

On Fri, Dec 29, 2017 at 04:22:18PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Create a copy of an existing string and make all characters upper case.
> Similar xstrdup_tolower().
> 
> This function is used in a subsequent commit.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  strbuf.c | 13 +++++++++++++
>  strbuf.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/strbuf.c b/strbuf.c
> index 323c49ceb3..54276e96e7 100644
> --- a/strbuf.c
> +++ b/strbuf.c
> @@ -760,6 +760,19 @@ char *xstrdup_tolower(const char *string)
>  	return result;
>  }
>  
> +char *xstrdup_toupper(const char *string)
> +{
> +	char *result;
> +	size_t len, i;
> +
> +	len = strlen(string);
> +	result = xmallocz(len);
> +	for (i = 0; i < len; i++)
> +		result[i] = toupper(string[i]);
> +	result[i] = '\0';
        ^^^^^^^^^^^^^^^^
	Isn't this already done by xmallocz()

> +	return result;
> +}
> +
>  char *xstrvfmt(const char *fmt, va_list ap)
>  {
>  	struct strbuf buf = STRBUF_INIT;
> diff --git a/strbuf.h b/strbuf.h
> index 0a74acb236..2bc148526f 100644
> --- a/strbuf.h
> +++ b/strbuf.h
> @@ -616,6 +616,7 @@ __attribute__((format (printf,2,3)))
>  extern int fprintf_ln(FILE *fp, const char *fmt, ...);
>  
>  char *xstrdup_tolower(const char *);
> +char *xstrdup_toupper(const char *);
>  
>  /**
>   * Create a newly allocated string using printf format. You can do this easily
> -- 
> 2.15.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/5] strbuf: add xstrdup_toupper()
  2017-12-29 15:56   ` Torsten Bögershausen
@ 2017-12-29 16:55     ` Lars Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2017-12-29 16:55 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: lars.schneider, git, gitster, j6t, sunshine, peff


> On 29 Dec 2017, at 16:56, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Fri, Dec 29, 2017 at 04:22:18PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Create a copy of an existing string and make all characters upper case.
>> Similar xstrdup_tolower().
>> 
>> This function is used in a subsequent commit.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> strbuf.c | 13 +++++++++++++
>> strbuf.h |  1 +
>> 2 files changed, 14 insertions(+)
>> 
>> diff --git a/strbuf.c b/strbuf.c
>> index 323c49ceb3..54276e96e7 100644
>> --- a/strbuf.c
>> +++ b/strbuf.c
>> @@ -760,6 +760,19 @@ char *xstrdup_tolower(const char *string)
>> 	return result;
>> }
>> 
>> +char *xstrdup_toupper(const char *string)
>> +{
>> +	char *result;
>> +	size_t len, i;
>> +
>> +	len = strlen(string);
>> +	result = xmallocz(len);
>> +	for (i = 0; i < len; i++)
>> +		result[i] = toupper(string[i]);
>> +	result[i] = '\0';
>        ^^^^^^^^^^^^^^^^
> 	Isn't this already done by xmallocz()

I copied that code from xstrdup_tolower().

The original implementation [1] and its refactored version [2]
used xmalloc(). Later on xmallocz [3] was introduced.

[3] states "we can stop manually placing NUL at the end of the
allocated buffer. But that's only safe if it's clear that
the contents will always fill the buffer."

As far as I understand it, the content should always fill the
buffer in the upper/lower case conversion. Therefore, I agree 
with you that the assignment is not necessary.

- Lars


[1] d4770964d5 (config: "git config --get-urlmatch" parses section.<url>.key, 2013-07-31)
[2] 88d5a6f6cd (daemon/config: factor out duplicate xstrdup_tolower, 2014-05-22)
[3] 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute
  2017-12-29 15:22 ` [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute lars.schneider
@ 2017-12-29 17:28   ` Torsten Bögershausen
  2017-12-30 19:58     ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Torsten Bögershausen @ 2017-12-29 17:28 UTC (permalink / raw)
  To: lars.schneider; +Cc: git, gitster, j6t, sunshine, peff, Lars Schneider

I probably need to look at convert.c more closer, some other comments inline.


On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).

UTF-8 is too strict, the text from below is more correct:
 +Git recognizes files encoded with ASCII or one of its supersets (e.g.
 +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
 +interpreted as binary and consequently built-in Git text processing
 +tools (e.g. 'git diff') as well as most Git web front ends do not
 +visualize the content.


> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will

Minor question about "canonical":
Would this mean the same ?
 ...then Git converts the content into  UTF-8.


> reverse the conversion.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt |  59 ++++++++++++
>  apply.c                         |   2 +-
>  blame.c                         |   2 +-
>  combine-diff.c                  |   2 +-
>  convert.c                       | 196 ++++++++++++++++++++++++++++++++++++++-
>  convert.h                       |   8 +-
>  diff.c                          |   2 +-
>  sha1_file.c                     |   5 +-
>  t/t0028-checkout-encoding.sh    | 197 ++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 460 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0028-checkout-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..0039bd38c3 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,65 @@ few exceptions.  Even though...
>    catch potential problems early, safety triggers.
>  
>  
> +`checkout-encoding`
> +^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +

> +In these cases you can teach Git the encoding of a file in the working
                          teach ? tell ? 
> +directory with the `checkout-encoding` attribute. If a file with this
> +attributes is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8 and stores the result in its internal data
> +structure.

Minor Q:
> its internal data structure.
Should we simply write "stores the result in the index" ?

> On checkout the content is encoded back to the specified
> +encoding.

> +
> +Please note that using the `checkout-encoding` attribute has a number
> +of drawbacks:
> +
> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
> +  errors as the conversion might not be round trip safe.
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +- Git clients that do not support the `checkout-encoding` attribute or
> +  the used encoding will checkout the respective files as UTF-8 encoded.
> +  That means the content appears to be different which could cause
> +  trouble. Affected clients are older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of January 2018).
> +
> +Use the `checkout-encoding` attribute only if you cannot store a file in
> +UTF-8 encoding and if you want Git to be able to process the content as
> +text.
> +

I would maybe rephrase a little bit (first things first):

Please note that using the `checkout-encoding` attribute may have a number
of pitfalls:

- Git clients that do not support the `checkout-encoding` attribute
  will checkout the respective files as UTF-8 encoded,  which typically
  causes trouble.
  Escpecialy when older Git versions are used or alternative Git
  implementations such as JGit or libgit2 (as of January 2018).

- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
  errors as the conversion might not be round trip safe.

- Reencoding content requires resources that might slow down certain
  Git operations (e.g 'git checkout' or 'git add').

Use the `checkout-encoding` attribute only if you cannot store a file in
UTF-8 encoding and if you want Git to be able to process the content as
text.

-----
Side question: What happens if "the used encoding" is not supported by
the underlying iconv lib ?
Will Git fail, delete the file from the working tree ?
That may be worth to mention. (And I need to do the code-review)

> +Use the following attributes if your '*.txt' files are UTF-16 encoded
> +with byte order mark (BOM) and you want Git to perform automatic line
> +ending conversion based on your platform.
> +
> +------------------------
> +*.txt		text checkout-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.txt' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory.
> +
> +------------------------
> +*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:
> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +
>  `ident`
>  ^^^^^^^
>  
> diff --git a/apply.c b/apply.c
> index 321a9fa68d..c4bd5cf1f2 100644
> --- a/apply.c
> +++ b/apply.c
> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
>  		 * should never look at the index when explicit crlf option
>  		 * is given.
>  		 */
> -		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
> +		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);

Hm. Do we really need another parameter here?
The only caller that needs it (and sets it to 1) is sha1_file.c.
When an invalid encoding/content is used, it should be safe to die() always?
Just loud thinking..

If really needed, it may need less changes in the code base, if a new function
convert_to_git_or_die() is defined, which is called by convert_to_git()

(and the other alternative would be to convert safe_crlf into a bitmap,
 see https://public-inbox.org/git/20171229132828.17594-1-tboegi@web.de/T/#u
 what do you think ?)


>  		return 0;
>  	default:
>  		return -1;
> diff --git a/blame.c b/blame.c
> index 2893f3c103..388b66897b 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
>  		if (strbuf_read(&buf, 0, 0) < 0)
>  			die_errno("failed to read from stdin");
>  	}
> -	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0);
> +	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0, 0);
>  	origin->file.ptr = buf.buf;
>  	origin->file.size = buf.len;
>  	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
> diff --git a/combine-diff.c b/combine-diff.c
> index 2505de119a..4555e49b5f 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
>  			if (is_file) {
>  				struct strbuf buf = STRBUF_INIT;
>  
> -				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
> +				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf, 0)) {
>  					free(result);
>  					result = strbuf_detach(&buf, &len);
>  					result_size = len;
> diff --git a/convert.c b/convert.c
> index 20d7ab67bd..fc8c96b670 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -256,6 +257,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>  
>  }
>  
> +static struct encoding {
> +	const char *name;
> +	struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc, int write_obj)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is prohibited for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is prohibited with this encoding. Either use "
> +			"%.6s as checkout encoding or remove the BOM from the "
> +			"file.");
> +
> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
> +		if (write_obj)
> +			die(error_msg, path, enc->name);
> +		else
> +			error(error_msg, path, enc->name);

As said before, I would just die(). Or do I miss something ?
Being strict with BOMs seams like a good idea.

> +
> +
> +	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is required for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is required with this encoding. Either use "
> +			"%sBE/%sLE as checkout encoding or add a BOM to the "
> +			"file.");
> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
> +		if (write_obj)
> +			die(error_msg, path, enc->name);
> +		else
> +			error(error_msg, path, enc->name);
> +	}
> +
> +	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
> +				  &dst_len);
> +	if (!dst) {
> +		/*
> +		 * We could add the blob "as-is" to Git. However, on checkout
> +		 * we would try to reencode to the original encoding. This
> +		 * would fail and we would leave the user with a messed-up
> +		 * working tree. Let's try to avoid this by screaming loud.
> +		 */
> +		const char* msg = _("failed to encode '%s' from %s to %s");
> +		if (write_obj)
> +			die(msg, path, enc->name, default_encoding);
> +		else
> +			error(msg, path, enc->name, default_encoding);
> +	}
> +
> +	/*
> +	 * UTF supports lossless round tripping [1]. UTF to other encoding are
> +	 * mostly round trip safe as Unicode aims to be a superset of all other
> +	 * character encodings. However, the SHIFT-JIS (Japanese character set)
> +	 * is an exception as some codes are not round trip safe [2].
> +	 *
> +	 * Reverse the transformation of 'dst' and check the result with 'src'
> +	 * if content is written to Git. This ensures no information is lost
> +	 * during conversion to/from UTF-8.
> +	 *
> +	 * Please note, the code below is not tested because I was not able to
> +	 * generate a faulty round trip without iconv error.
> +	 *
> +	 * [1] http://unicode.org/faq/utf_bom.html#gen2
> +	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
> +	 */
> +	if (write_obj && !strcmp(enc->name, "SHIFT-JIS")) {
> +		char *re_src;
> +		int re_src_len;
> +
> +		re_src = reencode_string_len(dst, dst_len,
> +					     enc->name, default_encoding,
> +					     &re_src_len);
> +
> +		if (!re_src || src_len != re_src_len ||
> +		    memcmp(src, re_src, src_len)) {
> +			const char* msg = _("encoding '%s' from %s to %s and "
> +					    "back is not the same");
> +			if (write_obj)
> +				die(msg, path, enc->name, default_encoding);
> +			else
> +				error(msg, path, enc->name, default_encoding);
> +		}
> +
> +		free(re_src);
> +	}
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
> +			      struct strbuf *buf, struct encoding *enc)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
> +				  &dst_len);
> +	if (!dst) {
> +		error("failed to encode '%s' from %s to %s",
> +			path, enc->name, default_encoding);
> +		return 0;
> +	}
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +}
> +
>  static int crlf_to_git(const struct index_state *istate,
>  		       const char *path, const char *src, size_t len,
>  		       struct strbuf *buf,
> @@ -969,6 +1111,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>  	return 1;
>  }
>  
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +	const char *value = check->value;
> +	struct encoding *enc;
> +
> +	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +	    !strlen(value))
> +		return NULL;
> +
> +	for (enc = encoding; enc; enc = enc->next)
> +		if (!strcasecmp(value, enc->name))
> +			return enc;
> +
> +	/* Don't encode to the default encoding */
> +	if (!strcasecmp(value, default_encoding))
> +		return NULL;
> +
> +	enc = xcalloc(1, sizeof(struct convert_driver));
> +	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
> +	*encoding_tail = enc;
> +	encoding_tail = &(enc->next);
> +
> +	return enc;
> +}
> +
>  static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
>  {
>  	const char *value = check->value;
> @@ -1024,6 +1191,7 @@ struct conv_attrs {
>  	enum crlf_action attr_action; /* What attr says */
>  	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
>  	int ident;
> +	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
>  };
>  
>  static void convert_attrs(struct conv_attrs *ca, const char *path)
> @@ -1032,8 +1200,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  
>  	if (!check) {
>  		check = attr_check_initl("crlf", "ident", "filter",
> -					 "eol", "text", NULL);
> +					 "eol", "text", "checkout-encoding",
> +					 NULL);
>  		user_convert_tail = &user_convert;
> +		encoding_tail = &encoding;
>  		git_config(read_convert_config, NULL);
>  	}
>  
> @@ -1055,6 +1225,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  			else if (eol_attr == EOL_CRLF)
>  				ca->crlf_action = CRLF_TEXT_CRLF;
>  		}
> +		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
>  	} else {
>  		ca->drv = NULL;
>  		ca->crlf_action = CRLF_UNDEFINED;
> @@ -1120,7 +1291,7 @@ const char *get_convert_attr_ascii(const char *path)
>  
>  int convert_to_git(const struct index_state *istate,
>  		   const char *path, const char *src, size_t len,
> -                   struct strbuf *dst, enum safe_crlf checksafe)
> +		   struct strbuf *dst, enum safe_crlf checksafe, int write_obj)
>  {
>  	int ret = 0;
>  	struct conv_attrs ca;
> @@ -1135,6 +1306,13 @@ int convert_to_git(const struct index_state *istate,
>  		src = dst->buf;
>  		len = dst->len;
>  	}
> +
> +	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, write_obj);
> +	if (ret && dst) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
>  		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
>  		if (ret && dst) {
> @@ -1147,7 +1325,7 @@ int convert_to_git(const struct index_state *istate,
>  
>  void convert_to_git_filter_fd(const struct index_state *istate,
>  			      const char *path, int fd, struct strbuf *dst,
> -			      enum safe_crlf checksafe)
> +			      enum safe_crlf checksafe, int write_obj)
>  {
>  	struct conv_attrs ca;
>  	convert_attrs(&ca, path);
> @@ -1158,6 +1336,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
>  	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>  
> +	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, write_obj);
>  	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
>  	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
>  }
> @@ -1189,6 +1368,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  		}
>  	}
>  
> +	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
> +	if (ret) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	ret_filter = apply_filter(
>  		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
>  	if (!ret_filter && ca.drv && ca.drv->required)
> @@ -1217,7 +1402,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
>  		src = dst->buf;
>  		len = dst->len;
>  	}
> -	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
> +	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE, 0);
>  }
>  
>  /*****************************************************************
> @@ -1655,6 +1840,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
>  	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
>  		return NULL;
>  
> +	if (ca.checkout_encoding)
> +		return NULL;
> +
>  	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
>  		return NULL;
>  
> diff --git a/convert.h b/convert.h
> index 4f2da225a8..9e4e884ec1 100644
> --- a/convert.h
> +++ b/convert.h
> @@ -66,7 +66,8 @@ extern const char *get_convert_attr_ascii(const char *path);
>  /* returns 1 if *dst was used */
>  extern int convert_to_git(const struct index_state *istate,
>  			  const char *path, const char *src, size_t len,
> -			  struct strbuf *dst, enum safe_crlf checksafe);
> +			  struct strbuf *dst, enum safe_crlf checksafe,
> +			  int write_obj);
>  extern int convert_to_working_tree(const char *path, const char *src,
>  				   size_t len, struct strbuf *dst);
>  extern int async_convert_to_working_tree(const char *path, const char *src,
> @@ -79,13 +80,14 @@ extern int renormalize_buffer(const struct index_state *istate,
>  static inline int would_convert_to_git(const struct index_state *istate,
>  				       const char *path)
>  {
> -	return convert_to_git(istate, path, NULL, 0, NULL, 0);
> +	return convert_to_git(istate, path, NULL, 0, NULL, 0, 0);
>  }
>  /* Precondition: would_convert_to_git_filter_fd(path) == true */
>  extern void convert_to_git_filter_fd(const struct index_state *istate,
>  				     const char *path, int fd,
>  				     struct strbuf *dst,
> -				     enum safe_crlf checksafe);
> +				     enum safe_crlf checksafe,
> +				     int write_obj);
>  extern int would_convert_to_git_filter_fd(const char *path);
>  
>  /*****************************************************************
> diff --git a/diff.c b/diff.c
> index 2ebe2227b4..16ca0bf0df 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  		/*
>  		 * Convert from working tree format to canonical git format
>  		 */
> -		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
> +		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn, 0)) {
>  			size_t size = 0;
>  			munmap(s->data, s->size);
>  			s->should_munmap = 0;
> diff --git a/sha1_file.c b/sha1_file.c
> index afe4b90f6e..75800248d2 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1694,7 +1694,8 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
>  	if ((type == OBJ_BLOB) && path) {
>  		struct strbuf nbuf = STRBUF_INIT;
>  		if (convert_to_git(&the_index, path, buf, size, &nbuf,
> -				   get_safe_crlf(flags))) {
> +				   get_safe_crlf(flags),
> +				   write_object)) {
>  			buf = strbuf_detach(&nbuf, &size);
>  			re_allocated = 1;
>  		}
> @@ -1728,7 +1729,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
>  	assert(would_convert_to_git_filter_fd(path));
>  
>  	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
> -				 get_safe_crlf(flags));
> +				 get_safe_crlf(flags), write_object);
>  
>  	if (write_object)
>  		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
> new file mode 100755
> index 0000000000..1a329ab933
> --- /dev/null
> +++ b/t/t0028-checkout-encoding.sh
> @@ -0,0 +1,197 @@
> +#!/bin/sh
> +
> +test_description='checkout-encoding conversion via gitattributes'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test repo' '
> +
> +	text="hallo there!\ncan you read me?" &&

Is this portable ? (the "\n")


> +
> +	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
> +
> +	printf "$text" >test.utf8.raw &&
> +	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> +	cp test.utf16.raw test.utf16 &&
> +
> +	git add .gitattributes test.utf16 &&
> +	git commit -m initial
> +'
> +
> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +	git cat-file -p :test.utf16 >test.utf16.git &&
> +	test_cmp_bin test.utf8.raw test.utf16.git &&
> +	rm test.utf8.raw test.utf16.git
> +'
> +
> +test_expect_success 're-encode to UTF-16 on checkout' '
> +	rm test.utf16 &&
> +	git checkout test.utf16 &&
> +	test_cmp_bin test.utf16.raw test.utf16 &&
> +
> +	# cleanup
> +	rm test.utf16.raw
> +'
> +
> +test_expect_success 'check prohibited UTF BOM' '
> +	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
> +	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
> +	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
> +	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
> +
> +	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
> +	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
> +	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
> +	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
> +
> +	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
> +	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
> +	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
> +	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
> +
> +	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
> +	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
> +	# the BOM is prohibited.
> +	cp bebom.utf16be.raw bebom.utf16be &&
> +	test_must_fail git add bebom.utf16be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +	cp lebom.utf16le.raw lebom.utf16be &&
> +	test_must_fail git add lebom.utf16be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +	cp bebom.utf16be.raw bebom.utf16le &&
> +	test_must_fail git add bebom.utf16le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +	cp lebom.utf16le.raw lebom.utf16le &&
> +	test_must_fail git add lebom.utf16le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +	# ... and the same for UTF-32
> +	cp bebom.utf32be.raw bebom.utf32be &&
> +	test_must_fail git add bebom.utf32be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +	cp lebom.utf32le.raw lebom.utf32be &&
> +	test_must_fail git add lebom.utf32be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +	cp bebom.utf32be.raw bebom.utf32le &&
> +	test_must_fail git add bebom.utf32le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +	cp lebom.utf32le.raw lebom.utf32le &&
> +	test_must_fail git add lebom.utf32le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +	# cleanup
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'check required UTF BOM' '
> +	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
> +
> +	cp nobom.utf16be.raw nobom.utf16 &&
> +	test_must_fail git add nobom.utf16 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +	cp nobom.utf16le.raw nobom.utf16 &&
> +	test_must_fail git add nobom.utf16 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +	cp nobom.utf32be.raw nobom.utf32 &&
> +	test_must_fail git add nobom.utf32 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +	cp nobom.utf32le.raw nobom.utf32 &&
> +	test_must_fail git add nobom.utf32 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +	# cleanup
> +	rm nobom.utf16 nobom.utf32 &&
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
> +	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
> +	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
> +
> +	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
> +	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
> +	cp crlf.utf16.raw eol.utf16 &&
> +
> +	git add eol.utf16 &&
> +	git commit -m eol &&
> +
> +	# UTF-16 with CRLF (Windows line endings)
> +	rm eol.utf16 &&
> +	git -c core.eol=crlf checkout eol.utf16 &&
> +	test_cmp_bin crlf.utf16.raw eol.utf16 &&
> +
> +	# UTF-16 with LF (Unix line endings)
> +	rm eol.utf16 &&
> +	git -c core.eol=lf checkout eol.utf16 &&
> +	test_cmp_bin lf.utf16.raw eol.utf16 &&
> +
> +	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
> +
> +	# cleanup
> +	git reset --hard HEAD^
> +'
> +
> +test_expect_success 'check unsupported encodings' '
> +
> +	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
> +	printf "nothing" >t.nothing &&
> +	git add t.nothing &&
> +
> +	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
> +	printf "garbage" >t.garbage &&
> +	test_must_fail git add t.garbage 2>err.out &&
> +	test_i18ngrep "fatal: failed to encode" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'error if encoding round trip is not the same during refresh' '
> +	BEFORE_STATE=$(git rev-parse HEAD) &&
> +
> +	# Skip the UTF-16 filter for the added file
> +	# This simulates a Git version that has no checkoutEncoding support
> +	echo "hallo" >nonsense.utf16 &&
> +	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
> +	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
> +	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
> +	git update-ref refs/heads/master $COMMIT &&
> +
> +	test_must_fail git checkout HEAD^ 2>err.out &&
> +	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard $BEFORE_STATE
> +'
> +
> +test_expect_success 'error if encoding garbage is already in Git' '
> +	BEFORE_STATE=$(git rev-parse HEAD) &&
> +
> +	# Skip the UTF-16 filter for the added file
> +	# This simulates a Git version that has no checkoutEncoding support
> +	cp nobom.utf16be.raw nonsense.utf16 &&
> +	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
> +	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
> +	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
> +	git update-ref refs/heads/master $COMMIT &&
> +
> +	git diff 2>err.out &&
> +	test_i18ngrep "error: BOM is required" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard $BEFORE_STATE
> +'
> +
> +test_done
> -- 
> 2.15.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute
  2017-12-29 17:28   ` Torsten Bögershausen
@ 2017-12-30 19:58     ` Lars Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2017-12-30 19:58 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, gitster, j6t, sunshine, peff


> On 29 Dec 2017, at 18:28, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> I probably need to look at convert.c more closer, some other comments inline.
> 
> 
> On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
> 
> UTF-8 is too strict, the text from below is more correct:
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.

Agreed. I'll replace it.


>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
> 
> Minor question about "canonical":
> Would this mean the same ?
> ...then Git converts the content into  UTF-8.

I feel the word canonical makes sense in this context to express
that UTF-8 is not some randomly chosen encoding. AFAIK UTF-8 and 
LF are the canonical form for text in Git, no?


>> +In these cases you can teach Git the encoding of a file in the working
>                          teach ? tell ? 

"tell", agreed!


>> +directory with the `checkout-encoding` attribute. If a file with this
>> +attributes is added to Git, then Git reencodes the content from the
>> +specified encoding to UTF-8 and stores the result in its internal data
>> +structure.
> 
> Minor Q:
>> its internal data structure.
> Should we simply write "stores the result in the index" ?

This text is targeted at the end user and I know a lot of end users
that are confused by the word "index". How about:

  ...stores the result in its internal data structure ("the index").

Would that work?

> 
>> On checkout the content is encoded back to the specified
>> +encoding.
> 
>> +
>> +Please note that using the `checkout-encoding` attribute has a number
>> +of drawbacks:
>> +
>> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>> +  errors as the conversion might not be round trip safe.
>> +
>> +- Reencoding content requires resources that might slow down certain
>> +  Git operations (e.g 'git checkout' or 'git add').
>> +
>> +- Git clients that do not support the `checkout-encoding` attribute or
>> +  the used encoding will checkout the respective files as UTF-8 encoded.
>> +  That means the content appears to be different which could cause
>> +  trouble. Affected clients are older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of January 2018).
>> +
>> +Use the `checkout-encoding` attribute only if you cannot store a file in
>> +UTF-8 encoding and if you want Git to be able to process the content as
>> +text.
>> +
> 
> I would maybe rephrase a little bit (first things first):
> 
> Please note that using the `checkout-encoding` attribute may have a number
> of pitfalls:
> 
> - Git clients that do not support the `checkout-encoding` attribute
>  will checkout the respective files as UTF-8 encoded,  which typically
>  causes trouble.
>  Escpecialy when older Git versions are used or alternative Git
>  implementations such as JGit or libgit2 (as of January 2018).
> 
> - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>  errors as the conversion might not be round trip safe.
> 
> - Reencoding content requires resources that might slow down certain
>  Git operations (e.g 'git checkout' or 'git add').
> 
> Use the `checkout-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.

Agreed!


> -----
> Side question: What happens if "the used encoding" is not supported by
> the underlying iconv lib ?
> Will Git fail, delete the file from the working tree ?
> That may be worth to mention. (And I need to do the code-review)

It should checkout the file in UTF-8 and print an error.
If you would try to add a file with an unsupported encoding,
then Git should die() and refuse the operation.
(see t0028: check unsupported encoding)


> 
>> +Use the following attributes if your '*.txt' files are UTF-16 encoded
>> +with byte order mark (BOM) and you want Git to perform automatic line
>> +ending conversion based on your platform.
>> +
>> +------------------------
>> +*.txt		text checkout-encoding=UTF-16
>> +------------------------
>> +
>> +Use the following attributes if your '*.txt' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory.
>> +
>> +------------------------
>> +*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
>> +------------------------
>> +
>> +You can get a list of all available encodings on your platform with the
>> +following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> +
>> `ident`
>> ^^^^^^^
>> 
>> diff --git a/apply.c b/apply.c
>> index 321a9fa68d..c4bd5cf1f2 100644
>> --- a/apply.c
>> +++ b/apply.c
>> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
>> 		 * should never look at the index when explicit crlf option
>> 		 * is given.
>> 		 */
>> -		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
>> +		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
> 
> Hm. Do we really need another parameter here?
> The only caller that needs it (and sets it to 1) is sha1_file.c.
> When an invalid encoding/content is used, it should be safe to die() always?
> Just loud thinking..

Unfortunately it is not safe to die() always. Peff described a situation here
where you could not "checkout away" from an error state:
https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/


> If really needed, it may need less changes in the code base, if a new function
> convert_to_git_or_die() is defined, which is called by convert_to_git()
> 
> (and the other alternative would be to convert safe_crlf into a bitmap,
> see https://public-inbox.org/git/20171229132828.17594-1-tboegi@web.de/T/#u
> what do you think ?)

I still need to review your patch, but a bitmap sounds like a good idea.
But why do we need a special bitmap? Can't we just pass "flags" and
calculate get_safe_crlf() in convert.c?


>> 
>> +
>> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +		const char *error_msg = _(
>> +			"BOM is prohibited for '%s' if encoded as %s");
>> +		const char *advise_msg = _(
>> +			"You told Git to treat '%s' as %s. A byte order mark "
>> +			"(BOM) is prohibited with this encoding. Either use "
>> +			"%.6s as checkout encoding or remove the BOM from the "
>> +			"file.");
>> +
>> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
>> +		if (write_obj)
>> +			die(error_msg, path, enc->name);
>> +		else
>> +			error(error_msg, path, enc->name);
> 
> As said before, I would just die(). Or do I miss something ?
> Being strict with BOMs seams like a good idea.

See above, Peffs test case. We run convert in a lot of places
(e.g. 'git diff' or if checkout away from a bad state).


>> 
>> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
>> new file mode 100755
>> index 0000000000..1a329ab933
>> --- /dev/null
>> +++ b/t/t0028-checkout-encoding.sh
>> @@ -0,0 +1,197 @@
>> +#!/bin/sh
>> +
>> +test_description='checkout-encoding conversion via gitattributes'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup test repo' '
>> +
>> +	text="hallo there!\ncan you read me?" &&
> 
> Is this portable ? (the "\n")

I assume it as it is used in t0001, t0008, t0024 and others.


Thanks for the review,
Lars


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 0/5] V2B: simplify convert.c/h
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2017-12-29 15:22 ` [PATCH v2 5/5] convert: add tracing for checkout-encoding lars.schneider
@ 2017-12-31  8:05 ` tboegi
  2017-12-31  8:05 ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer tboegi
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: tboegi @ 2017-12-31  8:05 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Simplify the convert.h/convert.c logic amd don't touch convert_to_git()
The rest is v2 from Lars

Lars Schneider (4):
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add support for 'checkout-encoding' attribute

Torsten Bögershausen (1):
  convert_to_git(): checksafe becomes an integer

 Documentation/gitattributes.txt |  59 +++++++++++
 apply.c                         |   4 +-
 convert.c                       | 210 +++++++++++++++++++++++++++++++++++++---
 convert.h                       |  19 ++--
 diff.c                          |   4 +-
 environment.c                   |   2 +-
 sha1_file.c                     |   8 +-
 strbuf.c                        |  13 +++
 strbuf.h                        |   1 +
 t/t0028-checkout-encoding.sh    | 197 +++++++++++++++++++++++++++++++++++++
 utf8.c                          |  37 +++++++
 utf8.h                          |  25 +++++
 12 files changed, 549 insertions(+), 30 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] convert_to_git(): checksafe becomes an integer
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2017-12-31  8:05 ` [PATCH 0/5] V2B: simplify convert.c/h tboegi
@ 2017-12-31  8:05 ` tboegi
  2017-12-31 12:23   ` Lars Schneider
  2017-12-31  8:05 ` [PATCH 2/5] strbuf: add xstrdup_toupper() tboegi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: tboegi @ 2017-12-31  8:05 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Today a small flaw is found in the code base:
An integer with the value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

In the next commit there is a need to turn checksafe into a bitmap, which
allows to tell convert_to_git() to obey the encoding attribute or not.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 apply.c       |  4 ++--
 convert.c     | 20 ++++++++++----------
 convert.h     | 18 ++++++++----------
 diff.c        |  4 ++--
 environment.c |  2 +-
 sha1_file.c   |  6 +++---
 6 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..a422516062 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,7 +2263,7 @@ static void show_stats(struct apply_state *state, struct patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 			 const char *path, struct strbuf *buf)
 {
-	enum safe_crlf safe_crlf = patch->crlf_in_old ?
+	int checksafe = patch->crlf_in_old ?
 		SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
 	switch (st->st_mode & S_IFMT) {
 	case S_IFLNK:
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, checksafe);
 		return 0;
 	default:
 		return -1;
diff --git a/convert.c b/convert.c
index 1a41a48e15..5efcc3b73b 100644
--- a/convert.c
+++ b/convert.c
@@ -195,13 +195,13 @@ static enum eol output_eol(enum crlf_action crlf_action)
 
 static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
 			    struct text_stat *old_stats, struct text_stat *new_stats,
-			    enum safe_crlf checksafe)
+			    int checksafe)
 {
 	if (old_stats->crlf && !new_stats->crlf ) {
 		/*
 		 * CRLFs would not be restored by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (checksafe & SAFE_CRLF_WARN)
 			warning(_("CRLF will be replaced by LF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
@@ -211,7 +211,7 @@ static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
 		/*
 		 * CRLFs would be added by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (checksafe & SAFE_CRLF_WARN)
 			warning(_("LF will be replaced by CRLF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
@@ -268,7 +268,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
-		       enum crlf_action crlf_action, enum safe_crlf checksafe)
+		       enum crlf_action crlf_action, int checksafe)
 {
 	struct text_stat stats;
 	char *dst;
@@ -298,12 +298,12 @@ static int crlf_to_git(const struct index_state *istate,
 		 * unless we want to renormalize in a merge or
 		 * cherry-pick.
 		 */
-		if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+		if ((!(checksafe & SAFE_CRLF_RENORMALIZE)) &&
 		    has_crlf_in_index(istate, path))
 			convert_crlf_into_lf = 0;
 	}
-	if ((checksafe == SAFE_CRLF_WARN ||
-	    (checksafe == SAFE_CRLF_FAIL)) && len) {
+	if (((checksafe & SAFE_CRLF_WARN) ||
+	     ((checksafe & SAFE_CRLF_FAIL) && len))) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
@@ -1129,7 +1129,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, int checksafe)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1144,7 +1144,7 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
-	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
+	if (!(checksafe & SAFE_CRLF_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
 		if (ret && dst) {
 			src = dst->buf;
@@ -1156,7 +1156,7 @@ int convert_to_git(const struct index_state *istate,
 
 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      int checksafe)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
diff --git a/convert.h b/convert.h
index 4f2da225a8..532af00423 100644
--- a/convert.h
+++ b/convert.h
@@ -8,15 +8,13 @@
 
 struct index_state;
 
-enum safe_crlf {
-	SAFE_CRLF_FALSE = 0,
-	SAFE_CRLF_FAIL = 1,
-	SAFE_CRLF_WARN = 2,
-	SAFE_CRLF_RENORMALIZE = 3,
-	SAFE_CRLF_KEEP_CRLF = 4
-};
+#define SAFE_CRLF_FALSE       0
+#define SAFE_CRLF_FAIL        (1<<0)
+#define SAFE_CRLF_WARN        (1<<1)
+#define SAFE_CRLF_RENORMALIZE (1<<2)
+#define SAFE_CRLF_KEEP_CRLF   (1<<3)
 
-extern enum safe_crlf safe_crlf;
+extern int safe_crlf;
 
 enum auto_crlf {
 	AUTO_CRLF_FALSE = 0,
@@ -66,7 +64,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, int checksafe);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -85,7 +83,7 @@ static inline int would_convert_to_git(const struct index_state *istate,
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     int checksafe);
 extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
diff --git a/diff.c b/diff.c
index fb22b19f09..5e3aaea6e0 100644
--- a/diff.c
+++ b/diff.c
@@ -3524,7 +3524,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
 	 */
-	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
+	int checksafe = (safe_crlf == SAFE_CRLF_FAIL
 				    ? SAFE_CRLF_WARN
 				    : safe_crlf);
 
@@ -3603,7 +3603,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, checksafe)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..10ee89a28d 100644
--- a/environment.c
+++ b/environment.c
@@ -49,7 +49,7 @@ enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
 int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
-enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
+int safe_crlf = SAFE_CRLF_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/sha1_file.c b/sha1_file.c
index 3da70ac650..78e002392e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -133,7 +133,7 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
 }
 
 
-static enum safe_crlf get_safe_crlf(unsigned flags)
+static int get_checksafe(unsigned flags)
 {
 	if (flags & HASH_RENORMALIZE)
 		return SAFE_CRLF_RENORMALIZE;
@@ -1752,7 +1752,7 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_checksafe(flags))) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1786,7 +1786,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));
 
 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_checksafe(flags));
 
 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] strbuf: add xstrdup_toupper()
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2017-12-31  8:05 ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer tboegi
@ 2017-12-31  8:05 ` tboegi
  2017-12-31  8:05 ` [PATCH 3/5] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: tboegi @ 2017-12-31  8:05 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 strbuf.c | 13 +++++++++++++
 strbuf.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 8007be8fba..ee05626dc1 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -785,6 +785,19 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	result[i] = '\0';
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] utf8: add function to detect prohibited UTF-16/32 BOM
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (7 preceding siblings ...)
  2017-12-31  8:05 ` [PATCH 2/5] strbuf: add xstrdup_toupper() tboegi
@ 2017-12-31  8:05 ` tboegi
  2017-12-31  8:05 ` [PATCH 4/5] utf8: add function to detect a missing " tboegi
  2017-12-31  8:06 ` [PATCH 5/5] convert: add support for 'checkout-encoding' attribute tboegi
  10 siblings, 0 replies; 26+ messages in thread
From: tboegi @ 2017-12-31  8:05 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..776660ee12 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+const char utf16_be_bom[] = {0xFE, 0xFF};
+const char utf16_le_bom[] = {0xFF, 0xFE};
+const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] utf8: add function to detect a missing UTF-16/32 BOM
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (8 preceding siblings ...)
  2017-12-31  8:05 ` [PATCH 3/5] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
@ 2017-12-31  8:05 ` tboegi
  2017-12-31  8:06 ` [PATCH 5/5] convert: add support for 'checkout-encoding' attribute tboegi
  10 siblings, 0 replies; 26+ messages in thread
From: tboegi @ 2017-12-31  8:05 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 776660ee12..1978d6c42a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] convert: add support for 'checkout-encoding' attribute
  2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
                   ` (9 preceding siblings ...)
  2017-12-31  8:05 ` [PATCH 4/5] utf8: add function to detect a missing " tboegi
@ 2017-12-31  8:06 ` tboegi
  2018-01-05 14:45   ` Johannes Schindelin
  10 siblings, 1 reply; 26+ messages in thread
From: tboegi @ 2017-12-31  8:06 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Lars Schneider <larsxschneider@gmail.com>

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 Documentation/gitattributes.txt |  59 ++++++++++++
 convert.c                       | 190 +++++++++++++++++++++++++++++++++++++-
 convert.h                       |  11 ++-
 sha1_file.c                     |   2 +-
 t/t0028-checkout-encoding.sh    | 197 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 452 insertions(+), 7 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..0039bd38c3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,65 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`checkout-encoding`
+^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure. On checkout the content is encoded back to the specified
+encoding.
+
+Please note that using the `checkout-encoding` attribute has a number
+of drawbacks:
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+- Git clients that do not support the `checkout-encoding` attribute or
+  the used encoding will checkout the respective files as UTF-8 encoded.
+  That means the content appears to be different which could cause
+  trouble. Affected clients are older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of January 2018).
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text checkout-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index 5efcc3b73b..22c70d87e5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int die_on_failure)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as checkout encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (die_on_failure)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as checkout encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (die_on_failure)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (die_on_failure)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	/*
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if (die_on_failure && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (die_on_failure)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1120,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1200,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1209,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "checkout-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1234,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1132,6 +1303,7 @@ int convert_to_git(const struct index_state *istate,
 		   struct strbuf *dst, int checksafe)
 {
 	int ret = 0;
+	int die_on_failure = checksafe & SAFE_CRLF_DIE_ON_ERROR;
 	struct conv_attrs ca;
 
 	convert_attrs(&ca, path);
@@ -1140,6 +1312,11 @@ int convert_to_git(const struct index_state *istate,
 	if (!ret && ca.drv && ca.drv->required)
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, die_on_failure);
 	if (ret && dst) {
 		src = dst->buf;
 		len = dst->len;
@@ -1159,6 +1336,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 			      int checksafe)
 {
 	struct conv_attrs ca;
+	int die_on_failure = checksafe & SAFE_CRLF_DIE_ON_ERROR;
 	convert_attrs(&ca, path);
 
 	assert(ca.drv);
@@ -1167,6 +1345,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, die_on_failure);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1377,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1849,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.checkout_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 532af00423..3a85c2f9a5 100644
--- a/convert.h
+++ b/convert.h
@@ -8,11 +8,12 @@
 
 struct index_state;
 
-#define SAFE_CRLF_FALSE       0
-#define SAFE_CRLF_FAIL        (1<<0)
-#define SAFE_CRLF_WARN        (1<<1)
-#define SAFE_CRLF_RENORMALIZE (1<<2)
-#define SAFE_CRLF_KEEP_CRLF   (1<<3)
+#define SAFE_CRLF_FALSE        0
+#define SAFE_CRLF_FAIL         (1<<0)
+#define SAFE_CRLF_WARN         (1<<1)
+#define SAFE_CRLF_RENORMALIZE  (1<<2)
+#define SAFE_CRLF_KEEP_CRLF    (1<<3)
+#define SAFE_CRLF_DIE_ON_ERROR (1<<4)
 
 extern int safe_crlf;
 
diff --git a/sha1_file.c b/sha1_file.c
index 78e002392e..c39b298f83 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_checksafe(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return SAFE_CRLF_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-		return safe_crlf;
+		return safe_crlf | SAFE_CRLF_DIE_ON_ERROR;
 	else
 		return SAFE_CRLF_FALSE;
 }
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
new file mode 100755
index 0000000000..1a329ab933
--- /dev/null
+++ b/t/t0028-checkout-encoding.sh
@@ -0,0 +1,197 @@
+#!/bin/sh
+
+test_description='checkout-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+
+	text="hallo there!\ncan you read me?" &&
+
+	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
+
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] convert_to_git(): checksafe becomes an integer
  2017-12-31  8:05 ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer tboegi
@ 2017-12-31 12:23   ` Lars Schneider
  2018-01-01 21:59     ` [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags tboegi
  2018-01-05 23:22     ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer Junio C Hamano
  0 siblings, 2 replies; 26+ messages in thread
From: Lars Schneider @ 2017-12-31 12:23 UTC (permalink / raw)
  To: tboegi; +Cc: Jeff King, Johannes Sixt, lars.schneider, git, gitster, patrick


> On 31 Dec 2017, at 09:05, tboegi@web.de wrote:
> 
> From: Torsten Bögershausen <tboegi@web.de>
> 
> When calling convert_to_git(), the checksafe parameter has been used to
> check if commit would give a non-roundtrip conversion of EOL.
> 
> When checksafe was introduced, 3 values had been in use:
> SAFE_CRLF_FALSE: no warning
> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

In general, I really like the direction as this simplifies
my patch later on in 5/5. However, I see a few problems:

(1) The prefix "SAFE_CRLF" confuses me because the main theme
    is EOL and CRLF just happens to be a EOL type.

    What do you think about something like this:
    CONVERT_ERROR_IGNORE     0
    CONVERT_EOL_ERROR_DIE    (1<<0)
    CONVERT_EOL_ERROR_WARN   (1<<1)
    CONVERT_EOL_TO_LF        (1<<2)
    CONVERT_EOL_KEEP_CRLF    (1<<3)
    CONVERT_ENCODE_ERROR_DIE (1<<4)

(2) We mix error reporting switches (FALSE/FAIL/WARN) with
    switches that change the content (RENORMALIZE/KEEP_CRLF).
    Plus, these the switches should be mutually exclusive
    (e.g. we don't want to enable the FAIL and WARN bit at
    the same time).

(3) We kind of replicate some flags defined in cache.h:
    #define HASH_WRITE_OBJECT 1
    #define HASH_RENORMALIZE  4


- Lars



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2017-12-31 12:23   ` Lars Schneider
@ 2018-01-01 21:59     ` tboegi
  2018-01-02 19:11       ` Lars Schneider
  2018-01-05 19:00       ` Lars Schneider
  2018-01-05 23:22     ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer Junio C Hamano
  1 sibling, 2 replies; 26+ messages in thread
From: tboegi @ 2018-01-01 21:59 UTC (permalink / raw)
  To: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider
  Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When calling convert_to_git(), the checksafe parameter has been used to
check if commit would give a non-roundtrip conversion of EOL.

When checksafe was introduced, 3 values had been in use:
SAFE_CRLF_FALSE: no warning
SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip

Already today the integer value 0 is passed as the parameter checksafe
instead of the correct enum value SAFE_CRLF_FALSE.

Turn the whole call chain to use an integer with single bits, which
can be extended in the next commits:
- The global configuration variable safe_crlf is now conv_flags_eol.
- The parameter checksafe is renamed into conv_flags.

Helped-By: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
This is my suggestion.
(1) The flag bits had been renamed.
(2) The (theoretical ?) mix of WARN/FAIL is still there,
    I am not sure if this is a real problem.

(3) There are 2 reasons that CONV_EOL_RENORMALIZE is set.
    Either in a renormalizing merge, or by running
    git add --renormalize .
    Therefor HASH_RENORMALIZE is not the same as CONV_EOL_RENORMALIZE.

apply.c        |  6 +++---
 combine-diff.c |  2 +-
 config.c       |  7 +++++--
 convert.c      | 38 +++++++++++++++++++-------------------
 convert.h      | 17 +++++++----------
 diff.c         |  8 ++++----
 environment.c  |  2 +-
 sha1_file.c    | 12 ++++++------
 8 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..f8b67bfee2 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 			 const char *path, struct strbuf *buf)
 {
-	enum safe_crlf safe_crlf = patch->crlf_in_old ?
-		SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+	int conv_flags = patch->crlf_in_old ?
+		CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
 	switch (st->st_mode & S_IFMT) {
 	case S_IFLNK:
 		if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
 		return 0;
 	default:
 		return -1;
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..dbc877d0fe 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 			if (is_file) {
 				struct strbuf buf = STRBUF_INIT;
 
-				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
+				if (convert_to_git(&the_index, elem->path, result, len, &buf, conv_flags_eol)) {
 					free(result);
 					result = strbuf_detach(&buf, &len);
 					result_size = len;
diff --git a/config.c b/config.c
index e617c2018d..bdc7ce2a7e 100644
--- a/config.c
+++ b/config.c
@@ -1149,11 +1149,14 @@ static int git_default_core_config(const char *var, const char *value)
 	}
 
 	if (!strcmp(var, "core.safecrlf")) {
+		int eol_rndtrp_die;
 		if (value && !strcasecmp(value, "warn")) {
-			safe_crlf = SAFE_CRLF_WARN;
+			conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 			return 0;
 		}
-		safe_crlf = git_config_bool(var, value);
+		eol_rndtrp_die = git_config_bool(var, value);
+		conv_flags_eol = eol_rndtrp_die ?
+			CONV_EOL_RNDTRP_DIE : CONV_EOL_RNDTRP_WARN;
 		return 0;
 	}
 
diff --git a/convert.c b/convert.c
index 1a41a48e15..0207ddab24 100644
--- a/convert.c
+++ b/convert.c
@@ -193,30 +193,30 @@ static enum eol output_eol(enum crlf_action crlf_action)
 	return core_eol;
 }
 
-static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
+static void check_conv_flags_eol(const char *path, enum crlf_action crlf_action,
 			    struct text_stat *old_stats, struct text_stat *new_stats,
-			    enum safe_crlf checksafe)
+			    int conv_flags)
 {
 	if (old_stats->crlf && !new_stats->crlf ) {
 		/*
 		 * CRLFs would not be restored by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("CRLF would be replaced by LF in %s."), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("CRLF will be replaced by LF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("CRLF would be replaced by LF in %s."), path);
 	} else if (old_stats->lonelf && !new_stats->lonelf ) {
 		/*
 		 * CRLFs would be added by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("LF would be replaced by CRLF in %s"), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("LF will be replaced by CRLF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("LF would be replaced by CRLF in %s"), path);
 	}
 }
 
@@ -268,7 +268,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
-		       enum crlf_action crlf_action, enum safe_crlf checksafe)
+		       enum crlf_action crlf_action, int conv_flags)
 {
 	struct text_stat stats;
 	char *dst;
@@ -298,12 +298,12 @@ static int crlf_to_git(const struct index_state *istate,
 		 * unless we want to renormalize in a merge or
 		 * cherry-pick.
 		 */
-		if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+		if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
 		    has_crlf_in_index(istate, path))
 			convert_crlf_into_lf = 0;
 	}
-	if ((checksafe == SAFE_CRLF_WARN ||
-	    (checksafe == SAFE_CRLF_FAIL)) && len) {
+	if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
+	     ((conv_flags & CONV_EOL_RNDTRP_DIE) && len))) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
@@ -316,7 +316,7 @@ static int crlf_to_git(const struct index_state *istate,
 			new_stats.crlf += new_stats.lonelf;
 			new_stats.lonelf = 0;
 		}
-		check_safe_crlf(path, crlf_action, &stats, &new_stats, checksafe);
+		check_conv_flags_eol(path, crlf_action, &stats, &new_stats, conv_flags);
 	}
 	if (!convert_crlf_into_lf)
 		return 0;
@@ -1129,7 +1129,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, int conv_flags)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1144,8 +1144,8 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
-	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
-		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
+	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
+		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
 			src = dst->buf;
 			len = dst->len;
@@ -1156,7 +1156,7 @@ int convert_to_git(const struct index_state *istate,
 
 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      int conv_flags)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
@@ -1167,7 +1167,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
-	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
+	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
 
@@ -1226,7 +1226,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
 		src = dst->buf;
 		len = dst->len;
 	}
-	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
+	return ret | convert_to_git(istate, path, src, len, dst, CONV_EOL_RENORMALIZE);
 }
 
 /*****************************************************************
diff --git a/convert.h b/convert.h
index 4f2da225a8..21a176f16c 100644
--- a/convert.h
+++ b/convert.h
@@ -8,15 +8,12 @@
 
 struct index_state;
 
-enum safe_crlf {
-	SAFE_CRLF_FALSE = 0,
-	SAFE_CRLF_FAIL = 1,
-	SAFE_CRLF_WARN = 2,
-	SAFE_CRLF_RENORMALIZE = 3,
-	SAFE_CRLF_KEEP_CRLF = 4
-};
+#define CONV_EOL_RNDTRP_DIE   (1<<0)
+#define CONV_EOL_RNDTRP_WARN  (1<<1)
+#define CONV_EOL_RENORMALIZE  (1<<2)
+#define CONV_EOL_KEEP_CRLF    (1<<3)
 
-extern enum safe_crlf safe_crlf;
+extern int conv_flags_eol;
 
 enum auto_crlf {
 	AUTO_CRLF_FALSE = 0,
@@ -66,7 +63,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, int conv_flags);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -85,7 +82,7 @@ static inline int would_convert_to_git(const struct index_state *istate,
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     int conv_flags);
 extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
diff --git a/diff.c b/diff.c
index fb22b19f09..2470af52b2 100644
--- a/diff.c
+++ b/diff.c
@@ -3524,9 +3524,9 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
 	 */
-	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
-				    ? SAFE_CRLF_WARN
-				    : safe_crlf);
+	int conv_flags = (conv_flags_eol == CONV_EOL_RNDTRP_DIE
+				    ? CONV_EOL_RNDTRP_WARN
+				    : conv_flags_eol);
 
 	if (!DIFF_FILE_VALID(s))
 		die("internal error: asking to populate invalid file.");
@@ -3603,7 +3603,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, conv_flags)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..0df21eb633 100644
--- a/environment.c
+++ b/environment.c
@@ -49,7 +49,7 @@ enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
 int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
-enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
+int conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/sha1_file.c b/sha1_file.c
index 3da70ac650..c0714c1107 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -133,14 +133,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
 }
 
 
-static enum safe_crlf get_safe_crlf(unsigned flags)
+static int get_conv_flags(unsigned flags)
 {
 	if (flags & HASH_RENORMALIZE)
-		return SAFE_CRLF_RENORMALIZE;
+		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-		return safe_crlf;
+		return conv_flags_eol;
 	else
-		return SAFE_CRLF_FALSE;
+		return 0;
 }
 
 
@@ -1752,7 +1752,7 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_conv_flags(flags))) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1786,7 +1786,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));
 
 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_conv_flags(flags));
 
 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
-- 
2.16.0.rc0.4.ga4e00d4fa4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2018-01-01 21:59     ` [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags tboegi
@ 2018-01-02 19:11       ` Lars Schneider
  2018-01-03  5:36         ` Torsten Bögershausen
  2018-01-05 19:00       ` Lars Schneider
  1 sibling, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-01-02 19:11 UTC (permalink / raw)
  To: tboegi; +Cc: peff, j6t, lars.schneider, git, gitster, patrick


> On 01 Jan 2018, at 22:59, tboegi@web.de wrote:
> 
> From: Torsten Bögershausen <tboegi@web.de>
> 
> When calling convert_to_git(), the checksafe parameter has been used to
> check if commit would give a non-roundtrip conversion of EOL.
> 
> When checksafe was introduced, 3 values had been in use:
> SAFE_CRLF_FALSE: no warning
> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip
> 
> Already today the integer value 0 is passed as the parameter checksafe
> instead of the correct enum value SAFE_CRLF_FALSE.
> 
> Turn the whole call chain to use an integer with single bits, which
> can be extended in the next commits:
> - The global configuration variable safe_crlf is now conv_flags_eol.
> - The parameter checksafe is renamed into conv_flags.
> 
> Helped-By: Lars Schneider <larsxschneider@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
> This is my suggestion.
> (1) The flag bits had been renamed.
> (2) The (theoretical ?) mix of WARN/FAIL is still there,
>    I am not sure if this is a real problem.
> 
> (3) There are 2 reasons that CONV_EOL_RENORMALIZE is set.
>    Either in a renormalizing merge, or by running
>    git add --renormalize .
>    Therefor HASH_RENORMALIZE is not the same as CONV_EOL_RENORMALIZE.
> 
> apply.c        |  6 +++---
> combine-diff.c |  2 +-
> config.c       |  7 +++++--
> convert.c      | 38 +++++++++++++++++++-------------------
> convert.h      | 17 +++++++----------
> diff.c         |  8 ++++----
> environment.c  |  2 +-
> sha1_file.c    | 12 ++++++------
> 8 files changed, 46 insertions(+), 46 deletions(-)
> ...
> -static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
> +static void check_conv_flags_eol(const char *path, enum crlf_action crlf_action,
> 			    struct text_stat *old_stats, struct text_stat *new_stats,
> -			    enum safe_crlf checksafe)
> +			    int conv_flags)
> {
> 	if (old_stats->crlf && !new_stats->crlf ) {
> 		/*
> 		 * CRLFs would not be restored by checkout
> 		 */
> -		if (checksafe == SAFE_CRLF_WARN)
> +		if (conv_flags & CONV_EOL_RNDTRP_DIE)
> +			die(_("CRLF would be replaced by LF in %s."), path);
> +		else if (conv_flags & CONV_EOL_RNDTRP_WARN)

Here we go with CONV_EOL_RNDTRP_DIE if there is garbage in conv_flags.
Garbage example: CONV_EOL_RNDTRP_DIE *and* CONV_EOL_RNDTRP_WARN are set.

Good!

> 	...
> /*****************************************************************
> diff --git a/diff.c b/diff.c
> index fb22b19f09..2470af52b2 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3524,9 +3524,9 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> 	 * demote FAIL to WARN to allow inspecting the situation
> 	 * instead of refusing.
> 	 */
> -	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
> -				    ? SAFE_CRLF_WARN
> -				    : safe_crlf);
> +	int conv_flags = (conv_flags_eol == CONV_EOL_RNDTRP_DIE
> +				    ? CONV_EOL_RNDTRP_WARN
> +				    : conv_flags_eol);

If there is garbage in conv_flags_eol then we would not demote the DIE
flag here.

How about something like that:
int conv_flags = conv_flags_eol & ~CONV_EOL_RNDTRP_DIE;

---

In general I like the patch as I think the variables are a bit easier to understand. 
One thing I stumbled over while reading the patch:

The global variable "conv_flags_eol". I think the Git coding guidelines
have no recommendation for naming global variables. I think a "global_conv_flags_eol"
or something would have helped me. I can also see how others might frown upon such 
a naming scheme.

If this patch gets accepted then I will rebase my encoding series on top of it:
https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/


Thanks,
Lars


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2018-01-02 19:11       ` Lars Schneider
@ 2018-01-03  5:36         ` Torsten Bögershausen
  2018-01-03  9:37           ` Lars Schneider
  0 siblings, 1 reply; 26+ messages in thread
From: Torsten Bögershausen @ 2018-01-03  5:36 UTC (permalink / raw)
  To: Lars Schneider; +Cc: peff, j6t, lars.schneider, git, gitster, patrick

On Tue, Jan 02, 2018 at 08:11:51PM +0100, Lars Schneider wrote:
 
[snip]

> > /*****************************************************************
> > diff --git a/diff.c b/diff.c
> > index fb22b19f09..2470af52b2 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -3524,9 +3524,9 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> > 	 * demote FAIL to WARN to allow inspecting the situation
> > 	 * instead of refusing.
> > 	 */
> > -	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
> > -				    ? SAFE_CRLF_WARN
> > -				    : safe_crlf);
> > +	int conv_flags = (conv_flags_eol == CONV_EOL_RNDTRP_DIE
> > +				    ? CONV_EOL_RNDTRP_WARN
> > +				    : conv_flags_eol);
> 
> If there is garbage in conv_flags_eol then we would not demote the DIE
> flag here.
> 
> How about something like that:
> int conv_flags = conv_flags_eol & ~CONV_EOL_RNDTRP_DIE;

The next version will probably look like this:

int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
{
	int size_only = flags & CHECK_SIZE_ONLY;
	int err = 0;
	int conv_flags = conv_flags_eol;
	/*
	 * demote FAIL to WARN to allow inspecting the situation
	 * instead of refusing.
	 */
	if (conv_flags & CONV_EOL_RNDTRP_DIE)
		conv_flags =CONV_EOL_RNDTRP_WARN;

> 
> ---
> 
> In general I like the patch as I think the variables are a bit easier to understand. 
> One thing I stumbled over while reading the patch:
> 
> The global variable "conv_flags_eol". I think the Git coding guidelines
> have no recommendation for naming global variables. I think a "global_conv_flags_eol"
> or something would have helped me. I can also see how others might frown upon such 
> a naming scheme.

I don't have a problem with "global_conv_flags_eol".
Thank for the comments, let's wait for more comments before I send out V4.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2018-01-03  5:36         ` Torsten Bögershausen
@ 2018-01-03  9:37           ` Lars Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2018-01-03  9:37 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Jeff King, Johannes Sixt, Lars Schneider, Git List, gitster,
	patrick


> On 03 Jan 2018, at 06:36, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Tue, Jan 02, 2018 at 08:11:51PM +0100, Lars Schneider wrote:
> 
> [snip]
> 
>>> /*****************************************************************
>>> diff --git a/diff.c b/diff.c
>>> index fb22b19f09..2470af52b2 100644
>>> --- a/diff.c
>>> +++ b/diff.c
>>> @@ -3524,9 +3524,9 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>>> 	 * demote FAIL to WARN to allow inspecting the situation
>>> 	 * instead of refusing.
>>> 	 */
>>> -	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
>>> -				    ? SAFE_CRLF_WARN
>>> -				    : safe_crlf);
>>> +	int conv_flags = (conv_flags_eol == CONV_EOL_RNDTRP_DIE
>>> +				    ? CONV_EOL_RNDTRP_WARN
>>> +				    : conv_flags_eol);
>> 
>> If there is garbage in conv_flags_eol then we would not demote the DIE
>> flag here.
>> 
>> How about something like that:
>> int conv_flags = conv_flags_eol & ~CONV_EOL_RNDTRP_DIE;
> 
> The next version will probably look like this:
> 
> int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> {
> 	int size_only = flags & CHECK_SIZE_ONLY;
> 	int err = 0;
> 	int conv_flags = conv_flags_eol;
> 	/*
> 	 * demote FAIL to WARN to allow inspecting the situation
> 	 * instead of refusing.
> 	 */
> 	if (conv_flags & CONV_EOL_RNDTRP_DIE)
> 		conv_flags =CONV_EOL_RNDTRP_WARN;

This looks good. Sorry, I just realized that my suggestion 
was garbage as it only disabled the bit. It did not demote 
DIE to WARN.

One final nit:
Can we be sure that 'conv_flags_eol' is only ever used for CONV_EOL_RNDTRP_DIE
and CONV_EOL_RNDTRP_WARN? Maybe a solution like that would be more future
proof?

  if (conv_flags & CONV_EOL_RNDTRP_DIE) {
      conv_flags &= ~CONV_EOL_RNDTRP_DIE;
      conv_flags |= CONV_EOL_RNDTRP_WARN;
  }


>> 
>> ---
>> 
>> In general I like the patch as I think the variables are a bit easier to understand. 
>> One thing I stumbled over while reading the patch:
>> 
>> The global variable "conv_flags_eol". I think the Git coding guidelines
>> have no recommendation for naming global variables. I think a "global_conv_flags_eol"
>> or something would have helped me. I can also see how others might frown upon such 
>> a naming scheme.
> 
> I don't have a problem with "global_conv_flags_eol".
> Thank for the comments, let's wait for more comments before I send out V4.

Sounds good. A "global_" like prefix is not used anywhere else in Git...

- Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] convert: add support for 'checkout-encoding' attribute
  2017-12-31  8:06 ` [PATCH 5/5] convert: add support for 'checkout-encoding' attribute tboegi
@ 2018-01-05 14:45   ` Johannes Schindelin
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Schindelin @ 2018-01-05 14:45 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: peff, j6t, lars.schneider, git, gitster, patrick, larsxschneider

Hi,

On Sun, 31 Dec 2017, tboegi@web.de wrote:

> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
> new file mode 100755
> index 0000000000..1a329ab933
> --- /dev/null
> +++ b/t/t0028-checkout-encoding.sh
> @@ -0,0 +1,197 @@
> +#!/bin/sh
> +
> +test_description='checkout-encoding conversion via gitattributes'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test repo' '
> +
> +	text="hallo there!\ncan you read me?" &&
> +
> +	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
> +
> +	printf "$text" >test.utf8.raw &&
> +	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> +	cp test.utf16.raw test.utf16 &&
> +
> +	git add .gitattributes test.utf16 &&
> +	git commit -m initial
> +'
> +
> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +	git cat-file -p :test.utf16 >test.utf16.git &&
> +	test_cmp_bin test.utf8.raw test.utf16.git &&
> +	rm test.utf8.raw test.utf16.git
> +'
> +
> +test_expect_success 're-encode to UTF-16 on checkout' '
> +	rm test.utf16 &&
> +	git checkout test.utf16 &&
> +	test_cmp_bin test.utf16.raw test.utf16 &&

This fails here, due to an extra CR in test.utf16. This diff fixes it
(please include it in the next iteration of this patch series):

-- snipsnap --
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
index 1a329ab933b..aaf61bc9d7c 100755
--- a/t/t0028-checkout-encoding.sh
+++ b/t/t0028-checkout-encoding.sh
@@ -6,6 +6,7 @@ test_description='checkout-encoding conversion via
gitattributes'
 
 test_expect_success 'setup test repo' '
 
+	git config core.eol lf &&
 	text="hallo there!\ncan you read me?" &&
 
 	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2018-01-01 21:59     ` [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags tboegi
  2018-01-02 19:11       ` Lars Schneider
@ 2018-01-05 19:00       ` Lars Schneider
  2018-01-05 19:33         ` Torsten Bögershausen
  1 sibling, 1 reply; 26+ messages in thread
From: Lars Schneider @ 2018-01-05 19:00 UTC (permalink / raw)
  To: tboegi
  Cc: Jeff King, Johannes Sixt, Lars Schneider, Git List,
	Junio C Hamano, patrick


> On 01 Jan 2018, at 22:59, tboegi@web.de wrote:
> 
> From: Torsten Bögershausen <tboegi@web.de>
> 
> When calling convert_to_git(), the checksafe parameter has been used to
> check if commit would give a non-roundtrip conversion of EOL.
> 
> When checksafe was introduced, 3 values had been in use:
> SAFE_CRLF_FALSE: no warning
> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip
> 
> Already today the integer value 0 is passed as the parameter checksafe
> instead of the correct enum value SAFE_CRLF_FALSE.
> 
> Turn the whole call chain to use an integer with single bits, which
> can be extended in the next commits:
> - The global configuration variable safe_crlf is now conv_flags_eol.
> - The parameter checksafe is renamed into conv_flags.
> 
> Helped-By: Lars Schneider <larsxschneider@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
> This is my suggestion.
> (1) The flag bits had been renamed.
> (2) The (theoretical ?) mix of WARN/FAIL is still there,
>    I am not sure if this is a real problem.
> 
> (3) There are 2 reasons that CONV_EOL_RENORMALIZE is set.
>    Either in a renormalizing merge, or by running
>    git add --renormalize .
>    Therefor HASH_RENORMALIZE is not the same as CONV_EOL_RENORMALIZE.

Can you elaborate a bit? I am diving into the code but I am still confused.

I also noticed that the "flags" integer is potentially double booked with
the following values (see read-cache.c:add_to_index()):

#define ADD_CACHE_VERBOSE 1
#define ADD_CACHE_PRETEND 2
#define ADD_CACHE_IGNORE_ERRORS	4

#define HASH_WRITE_OBJECT 1
#define HASH_FORMAT_CHECK 2
#define HASH_RENORMALIZE  4

Is this intentional? 

Thanks,
Lars


More context:
  https://public-inbox.org/git/96B6CD4C-0A0C-47F5-922D-B8BAFB832FD1@gmail.com/
  (3) We kind of replicate some flags defined in cache.h:
     #define HASH_WRITE_OBJECT 1
     #define HASH_RENORMALIZE  4



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags
  2018-01-05 19:00       ` Lars Schneider
@ 2018-01-05 19:33         ` Torsten Bögershausen
  0 siblings, 0 replies; 26+ messages in thread
From: Torsten Bögershausen @ 2018-01-05 19:33 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Jeff King, Johannes Sixt, Lars Schneider, Git List,
	Junio C Hamano, patrick

On 2018-01-05 20:00, Lars Schneider wrote:
> 
>> On 01 Jan 2018, at 22:59, tboegi@web.de wrote:
>>
>> From: Torsten Bögershausen <tboegi@web.de>
>>
>> When calling convert_to_git(), the checksafe parameter has been used to
>> check if commit would give a non-roundtrip conversion of EOL.
>>
>> When checksafe was introduced, 3 values had been in use:
>> SAFE_CRLF_FALSE: no warning
>> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
>> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip
>>
>> Already today the integer value 0 is passed as the parameter checksafe
>> instead of the correct enum value SAFE_CRLF_FALSE.
>>
>> Turn the whole call chain to use an integer with single bits, which
>> can be extended in the next commits:
>> - The global configuration variable safe_crlf is now conv_flags_eol.
>> - The parameter checksafe is renamed into conv_flags.
>>
>> Helped-By: Lars Schneider <larsxschneider@gmail.com>
>> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
>> ---
>> This is my suggestion.
>> (1) The flag bits had been renamed.
>> (2) The (theoretical ?) mix of WARN/FAIL is still there,
>>    I am not sure if this is a real problem.
>>
>> (3) There are 2 reasons that CONV_EOL_RENORMALIZE is set.
>>    Either in a renormalizing merge, or by running
>>    git add --renormalize .
>>    Therefor HASH_RENORMALIZE is not the same as CONV_EOL_RENORMALIZE.
> 
> Can you elaborate a bit? I am diving into the code but I am still confused.
> 
> I also noticed that the "flags" integer is potentially double booked with
> the following values (see read-cache.c:add_to_index()):
> 
> #define ADD_CACHE_VERBOSE 1
> #define ADD_CACHE_PRETEND 2
> #define ADD_CACHE_IGNORE_ERRORS	4
> 
> #define HASH_WRITE_OBJECT 1
> #define HASH_FORMAT_CHECK 2
> #define HASH_RENORMALIZE  4
> 
> Is this intentional? 

All these flags have a different context,
and the right #define must be used in the right context/function call.
So there is no intention.
You start with 1, then use 2 and so on.

The same way as family Schmidt and family Meier both call
their child "Hans".
There is no connection.


> 
> Thanks,
> Lars
> 
> 
> More context:
>   https://public-inbox.org/git/96B6CD4C-0A0C-47F5-922D-B8BAFB832FD1@gmail.com/
>   (3) We kind of replicate some flags defined in cache.h:
>      #define HASH_WRITE_OBJECT 1
>      #define HASH_RENORMALIZE  4
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] convert_to_git(): checksafe becomes an integer
  2017-12-31 12:23   ` Lars Schneider
  2018-01-01 21:59     ` [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags tboegi
@ 2018-01-05 23:22     ` Junio C Hamano
  2018-01-06  0:55       ` Lars Schneider
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2018-01-05 23:22 UTC (permalink / raw)
  To: Lars Schneider
  Cc: tboegi, Jeff King, Johannes Sixt, lars.schneider, git, patrick

Lars Schneider <larsxschneider@gmail.com> writes:

>> On 31 Dec 2017, at 09:05, tboegi@web.de wrote:
>> 
>> From: Torsten Bögershausen <tboegi@web.de>
>> 
>> When calling convert_to_git(), the checksafe parameter has been used to
>> check if commit would give a non-roundtrip conversion of EOL.
>> 
>> When checksafe was introduced, 3 values had been in use:
>> SAFE_CRLF_FALSE: no warning
>> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
>> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip
>
> In general, I really like the direction as this simplifies
> my patch later on in 5/5. However, I see a few problems:
> ...

Yes, this looks like a sensible way to go.  I saw Torsten's v3 for
1/5 but will end up queuing v2b, as I suspect 5/5 would need to be
adjusted for the change between the two versions.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] convert_to_git(): checksafe becomes an integer
  2018-01-05 23:22     ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer Junio C Hamano
@ 2018-01-06  0:55       ` Lars Schneider
  0 siblings, 0 replies; 26+ messages in thread
From: Lars Schneider @ 2018-01-06  0:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: tboegi, Jeff King, Johannes Sixt, lars.schneider, git, patrick


> On 06 Jan 2018, at 00:22, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Lars Schneider <larsxschneider@gmail.com> writes:
> 
>>> On 31 Dec 2017, at 09:05, tboegi@web.de wrote:
>>> 
>>> From: Torsten Bögershausen <tboegi@web.de>
>>> 
>>> When calling convert_to_git(), the checksafe parameter has been used to
>>> check if commit would give a non-roundtrip conversion of EOL.
>>> 
>>> When checksafe was introduced, 3 values had been in use:
>>> SAFE_CRLF_FALSE: no warning
>>> SAFE_CRLF_FAIL:  reject the commit if EOL do not roundtrip
>>> SAFE_CRLF_WARN:  warn the user if EOL do not roundtrip
>> 
>> In general, I really like the direction as this simplifies
>> my patch later on in 5/5. However, I see a few problems:
>> ...
> 
> Yes, this looks like a sensible way to go.  I saw Torsten's v3 for
> 1/5 but will end up queuing v2b, as I suspect 5/5 would need to be
> adjusted for the change between the two versions.

I just send out my v3 which integrates the latest version of Torsten's
patch into my series:

https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/

Please note: I did not rebase my series. Therefore, there is a small
conflict with 86ff70a0f0 (convert: tighten the safe autocrlf handling, 2017-11-26)
because has_cr_in_index() was renamed to has_crlf_in_index().

@Junio: What do you prefer in these cases? A rebased series or the conflict?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-01-06  0:55 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-29 15:22 [PATCH v2 0/5] convert: add support for different encodings lars.schneider
2017-12-29 15:22 ` [PATCH v2 1/5] strbuf: add xstrdup_toupper() lars.schneider
2017-12-29 15:56   ` Torsten Bögershausen
2017-12-29 16:55     ` Lars Schneider
2017-12-29 15:22 ` [PATCH v2 2/5] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2017-12-29 15:22 ` [PATCH v2 3/5] utf8: add function to detect a missing " lars.schneider
2017-12-29 15:22 ` [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute lars.schneider
2017-12-29 17:28   ` Torsten Bögershausen
2017-12-30 19:58     ` Lars Schneider
2017-12-29 15:22 ` [PATCH v2 5/5] convert: add tracing for checkout-encoding lars.schneider
2017-12-31  8:05 ` [PATCH 0/5] V2B: simplify convert.c/h tboegi
2017-12-31  8:05 ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer tboegi
2017-12-31 12:23   ` Lars Schneider
2018-01-01 21:59     ` [PATCH v3 1/1] convert_to_git(): checksafe becomes int conv_flags tboegi
2018-01-02 19:11       ` Lars Schneider
2018-01-03  5:36         ` Torsten Bögershausen
2018-01-03  9:37           ` Lars Schneider
2018-01-05 19:00       ` Lars Schneider
2018-01-05 19:33         ` Torsten Bögershausen
2018-01-05 23:22     ` [PATCH 1/5] convert_to_git(): checksafe becomes an integer Junio C Hamano
2018-01-06  0:55       ` Lars Schneider
2017-12-31  8:05 ` [PATCH 2/5] strbuf: add xstrdup_toupper() tboegi
2017-12-31  8:05 ` [PATCH 3/5] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
2017-12-31  8:05 ` [PATCH 4/5] utf8: add function to detect a missing " tboegi
2017-12-31  8:06 ` [PATCH 5/5] convert: add support for 'checkout-encoding' attribute tboegi
2018-01-05 14:45   ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).