git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v3 0/7] convert: add support for different encodings
@ 2018-01-06  0:48 lars.schneider
  2018-01-06  0:48 ` [PATCH v3 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-5 and 6 are helper functions and preparation.
Patch 6 is the actual change.

I am still torn between "checkout-encoding" and "working-tree-encoding"
as attribute name. I am happy to hear arguments for/against one or the
other.

Changes since v2:

* Added Torsten's crlfsave refactoring patch (patch 5)
  @Torsten: I tried to make the commit message more clean, added
            some comments to and renamed conv_flags_eol to
            global_conv_flags_eol.

* Improved documentation and commit message (Torsten)

* Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)

* Set "git config core.eol lf" to made the test run on Windows (Dscho)

* Made BOM arrays static (Ramsay)


Thanks,
Lars


    RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
     v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
     v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/



Base Ref: master
Web-Diff: https://github.com/larsxschneider/git/commit/f21a1841a4
Checkout: git fetch https://github.com/larsxschneider/git encoding-v3 && git checkout f21a1841a4


### Interdiff (v2..v3):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 0039bd38c3..1bc03e69cb 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -285,11 +285,18 @@ In these cases you can teach Git the encoding of a file in the working
 directory with the `checkout-encoding` attribute. If a file with this
 attributes is added to Git, then Git reencodes the content from the
 specified encoding to UTF-8 and stores the result in its internal data
-structure. On checkout the content is encoded back to the specified
-encoding.
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.

-Please note that using the `checkout-encoding` attribute has a number
-of drawbacks:
+Please note that using the `checkout-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `checkout-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).

 - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
   errors as the conversion might not be round trip safe.
@@ -297,12 +304,6 @@ of drawbacks:
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').

-- Git clients that do not support the `checkout-encoding` attribute or
-  the used encoding will checkout the respective files as UTF-8 encoded.
-  That means the content appears to be different which could cause
-  trouble. Affected clients are older Git versions and alternative Git
-  implementations such as JGit or libgit2 (as of January 2018).
-
 Use the `checkout-encoding` attribute only if you cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.
diff --git a/apply.c b/apply.c
index c4bd5cf1f2..f8b67bfee2 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 			 const char *path, struct strbuf *buf)
 {
-	enum safe_crlf safe_crlf = patch->crlf_in_old ?
-		SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+	int conv_flags = patch->crlf_in_old ?
+		CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
 	switch (st->st_mode & S_IFMT) {
 	case S_IFLNK:
 		if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
 		return 0;
 	default:
 		return -1;
diff --git a/blame.c b/blame.c
index 388b66897b..2893f3c103 100644
--- a/blame.c
+++ b/blame.c
@@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
 		if (strbuf_read(&buf, 0, 0) < 0)
 			die_errno("failed to read from stdin");
 	}
-	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0, 0);
+	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0);
 	origin->file.ptr = buf.buf;
 	origin->file.size = buf.len;
 	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
diff --git a/combine-diff.c b/combine-diff.c
index 4555e49b5f..19f30c3353 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 			if (is_file) {
 				struct strbuf buf = STRBUF_INIT;

-				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf, 0)) {
+				if (convert_to_git(&the_index, elem->path, result, len, &buf, global_conv_flags_eol)) {
 					free(result);
 					result = strbuf_detach(&buf, &len);
 					result_size = len;
diff --git a/config.c b/config.c
index e617c2018d..1f003fbb90 100644
--- a/config.c
+++ b/config.c
@@ -1149,11 +1149,14 @@ static int git_default_core_config(const char *var, const char *value)
 	}

 	if (!strcmp(var, "core.safecrlf")) {
+		int eol_rndtrp_die;
 		if (value && !strcasecmp(value, "warn")) {
-			safe_crlf = SAFE_CRLF_WARN;
+			global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 			return 0;
 		}
-		safe_crlf = git_config_bool(var, value);
+		eol_rndtrp_die = git_config_bool(var, value);
+		global_conv_flags_eol = eol_rndtrp_die ?
+			CONV_EOL_RNDTRP_DIE : CONV_EOL_RNDTRP_WARN;
 		return 0;
 	}

diff --git a/convert.c b/convert.c
index ca7b2f3e5c..525958bb56 100644
--- a/convert.c
+++ b/convert.c
@@ -194,30 +194,30 @@ static enum eol output_eol(enum crlf_action crlf_action)
 	return core_eol;
 }

-static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
+static void check_global_conv_flags_eol(const char *path, enum crlf_action crlf_action,
 			    struct text_stat *old_stats, struct text_stat *new_stats,
-			    enum safe_crlf checksafe)
+			    int conv_flags)
 {
 	if (old_stats->crlf && !new_stats->crlf ) {
 		/*
 		 * CRLFs would not be restored by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("CRLF would be replaced by LF in %s."), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("CRLF will be replaced by LF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("CRLF would be replaced by LF in %s."), path);
 	} else if (old_stats->lonelf && !new_stats->lonelf ) {
 		/*
 		 * CRLFs would be added by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("LF would be replaced by CRLF in %s"), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("LF will be replaced by CRLF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("LF would be replaced by CRLF in %s"), path);
 	}
 }

@@ -287,7 +287,7 @@ static struct encoding {
 static const char *default_encoding = "UTF-8";

 static int encode_to_git(const char *path, const char *src, size_t src_len,
-			 struct strbuf *buf, struct encoding *enc, int write_obj)
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
 {
 	char *dst;
 	int dst_len;
@@ -318,7 +318,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			"file.");

 		advise(advise_msg, path, enc->name, enc->name, enc->name);
-		if (write_obj)
+		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
 			error(error_msg, path, enc->name);
@@ -333,7 +333,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			"%sBE/%sLE as checkout encoding or add a BOM to the "
 			"file.");
 		advise(advise_msg, path, enc->name, enc->name, enc->name);
-		if (write_obj)
+		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
 			error(error_msg, path, enc->name);
@@ -350,7 +350,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		 * working tree. Let's try to avoid this by screaming loud.
 		 */
 		const char* msg = _("failed to encode '%s' from %s to %s");
-		if (write_obj)
+		if (conv_flags & CONV_WRITE_OBJECT)
 			die(msg, path, enc->name, default_encoding);
 		else
 			error(msg, path, enc->name, default_encoding);
@@ -373,7 +373,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	 * [1] http://unicode.org/faq/utf_bom.html#gen2
 	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
 	 */
-	if (write_obj && !strcmp(enc->name, "SHIFT-JIS")) {
+	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
 		char *re_src;
 		int re_src_len;

@@ -388,7 +388,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
 					    "back is not the same");
-			if (write_obj)
+			if (conv_flags & CONV_WRITE_OBJECT)
 				die(msg, path, enc->name, default_encoding);
 			else
 				error(msg, path, enc->name, default_encoding);
@@ -429,7 +429,7 @@ static int encode_to_worktree(const char *path, const char *src, size_t src_len,
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
-		       enum crlf_action crlf_action, enum safe_crlf checksafe)
+		       enum crlf_action crlf_action, int conv_flags)
 {
 	struct text_stat stats;
 	char *dst;
@@ -459,12 +459,12 @@ static int crlf_to_git(const struct index_state *istate,
 		 * unless we want to renormalize in a merge or
 		 * cherry-pick.
 		 */
-		if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+		if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
 		    has_cr_in_index(istate, path))
 			convert_crlf_into_lf = 0;
 	}
-	if ((checksafe == SAFE_CRLF_WARN ||
-	    (checksafe == SAFE_CRLF_FAIL)) && len) {
+	if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
+	     ((conv_flags & CONV_EOL_RNDTRP_DIE) && len))) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
@@ -477,7 +477,7 @@ static int crlf_to_git(const struct index_state *istate,
 			new_stats.crlf += new_stats.lonelf;
 			new_stats.lonelf = 0;
 		}
-		check_safe_crlf(path, crlf_action, &stats, &new_stats, checksafe);
+		check_global_conv_flags_eol(path, crlf_action, &stats, &new_stats, conv_flags);
 	}
 	if (!convert_crlf_into_lf)
 		return 0;
@@ -1319,7 +1319,7 @@ const char *get_convert_attr_ascii(const char *path)

 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-		   struct strbuf *dst, enum safe_crlf checksafe, int write_obj)
+		   struct strbuf *dst, int conv_flags)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1335,14 +1335,14 @@ int convert_to_git(const struct index_state *istate,
 		len = dst->len;
 	}

-	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, write_obj);
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
 	if (ret && dst) {
 		src = dst->buf;
 		len = dst->len;
 	}

-	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
-		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
+	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
+		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
 			src = dst->buf;
 			len = dst->len;
@@ -1353,7 +1353,7 @@ int convert_to_git(const struct index_state *istate,

 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe, int write_obj)
+			      int conv_flags)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
@@ -1364,8 +1364,8 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

-	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, write_obj);
-	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
+	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }

@@ -1430,7 +1430,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
 		src = dst->buf;
 		len = dst->len;
 	}
-	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE, 0);
+	return ret | convert_to_git(istate, path, src, len, dst, CONV_EOL_RENORMALIZE);
 }

 /*****************************************************************
diff --git a/convert.h b/convert.h
index 9e4e884ec1..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -8,15 +8,13 @@

 struct index_state;

-enum safe_crlf {
-	SAFE_CRLF_FALSE = 0,
-	SAFE_CRLF_FAIL = 1,
-	SAFE_CRLF_WARN = 2,
-	SAFE_CRLF_RENORMALIZE = 3,
-	SAFE_CRLF_KEEP_CRLF = 4
-};
+#define CONV_EOL_RNDTRP_DIE   (1<<0) /* Die if CRLF to LF to CRLF is different */
+#define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
+#define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
+#define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */

-extern enum safe_crlf safe_crlf;
+extern int global_conv_flags_eol;

 enum auto_crlf {
 	AUTO_CRLF_FALSE = 0,
@@ -66,8 +64,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe,
-			  int write_obj);
+			  struct strbuf *dst, int conv_flags);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -80,14 +77,13 @@ extern int renormalize_buffer(const struct index_state *istate,
 static inline int would_convert_to_git(const struct index_state *istate,
 				       const char *path)
 {
-	return convert_to_git(istate, path, NULL, 0, NULL, 0, 0);
+	return convert_to_git(istate, path, NULL, 0, NULL, 0);
 }
 /* Precondition: would_convert_to_git_filter_fd(path) == true */
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe,
-				     int write_obj);
+				     int conv_flags);
 extern int would_convert_to_git_filter_fd(const char *path);

 /*****************************************************************
diff --git a/diff.c b/diff.c
index 16ca0bf0df..fe3ff6e876 100644
--- a/diff.c
+++ b/diff.c
@@ -3516,13 +3516,13 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 {
 	int size_only = flags & CHECK_SIZE_ONLY;
 	int err = 0;
+	int conv_flags = global_conv_flags_eol;
 	/*
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
 	 */
-	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
-				    ? SAFE_CRLF_WARN
-				    : safe_crlf);
+	if (conv_flags & CONV_EOL_RNDTRP_DIE)
+		conv_flags = CONV_EOL_RNDTRP_WARN;

 	if (!DIFF_FILE_VALID(s))
 		die("internal error: asking to populate invalid file.");
@@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn, 0)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, conv_flags)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/environment.c b/environment.c
index 8fa032f307..888b873f90 100644
--- a/environment.c
+++ b/environment.c
@@ -49,7 +49,7 @@ enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
 int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
-enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
+int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/sha1_file.c b/sha1_file.c
index 75800248d2..dcb02e9ffd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -75,14 +75,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
 }


-static enum safe_crlf get_safe_crlf(unsigned flags)
+static int get_conv_flags(unsigned flags)
 {
 	if (flags & HASH_RENORMALIZE)
-		return SAFE_CRLF_RENORMALIZE;
+		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-		return safe_crlf;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
-		return SAFE_CRLF_FALSE;
+		return 0;
 }


@@ -1694,8 +1694,7 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags),
-				   write_object)) {
+				   get_conv_flags(flags))) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1729,7 +1728,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));

 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags), write_object);
+				 get_conv_flags(flags));

 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
diff --git a/strbuf.c b/strbuf.c
index 54276e96e7..703a1556cb 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -756,7 +756,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }

@@ -769,7 +768,6 @@ char *xstrdup_toupper(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = toupper(string[i]);
-	result[i] = '\0';
 	return result;
 }

diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
index df3cc91269..5f1c911c07 100755
--- a/t/t0028-checkout-encoding.sh
+++ b/t/t0028-checkout-encoding.sh
@@ -7,11 +7,10 @@ test_description='checkout-encoding conversion via gitattributes'
 GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING

 test_expect_success 'setup test repo' '
+	git config core.eol lf &&

 	text="hallo there!\ncan you read me?" &&
-
 	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
-
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	cp test.utf16.raw test.utf16 &&
diff --git a/utf8.c b/utf8.c
index 1978d6c42a..f033fec1c2 100644
--- a/utf8.c
+++ b/utf8.c
@@ -544,10 +544,10 @@ static int has_bom_prefix(const char *data, size_t len,
 	return (len >= bom_len) && !memcmp(data, bom, bom_len);
 }

-const char utf16_be_bom[] = {0xFE, 0xFF};
-const char utf16_le_bom[] = {0xFF, 0xFE};
-const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
-const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};

 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {


### Patches

Lars Schneider (6):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add support for 'checkout-encoding' attribute
  convert: add tracing for checkout-encoding

Torsten Bögershausen (1):
  convert_to_git(): safe_crlf/checksafe becomes int conv_flags

 Documentation/gitattributes.txt |  60 ++++++++++
 apply.c                         |   6 +-
 combine-diff.c                  |   2 +-
 config.c                        |   7 +-
 convert.c                       | 256 ++++++++++++++++++++++++++++++++++++----
 convert.h                       |  18 ++-
 diff.c                          |   8 +-
 environment.c                   |   2 +-
 sha1_file.c                     |  12 +-
 strbuf.c                        |  13 +-
 strbuf.h                        |   1 +
 t/t0028-checkout-encoding.sh    | 198 +++++++++++++++++++++++++++++++
 utf8.c                          |  37 ++++++
 utf8.h                          |  25 ++++
 14 files changed, 597 insertions(+), 48 deletions(-)
 create mode 100755 t/t0028-checkout-encoding.sh


base-commit: 95ec6b1b3393eb6e26da40c565520a8db9796e9f
--
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-06  0:48 ` [PATCH v3 2/7] strbuf: add xstrdup_toupper() lars.schneider
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 323c49ceb3..b5d03a5029 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -756,7 +756,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/7] strbuf: add xstrdup_toupper()
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
  2018-01-06  0:48 ` [PATCH v3 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-06  0:48 ` [PATCH v3 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index b5d03a5029..703a1556cb 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -759,6 +759,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 0a74acb236..2bc148526f 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -616,6 +616,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
  2018-01-06  0:48 ` [PATCH v3 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-01-06  0:48 ` [PATCH v3 2/7] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-06  0:48 ` [PATCH v3 4/7] utf8: add function to detect a missing " lars.schneider
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-01-06  0:48 ` [PATCH v3 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-06  0:48 ` [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags lars.schneider
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..f033fec1c2 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..26b5e91852 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-01-06  0:48 ` [PATCH v3 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-08 21:28   ` Junio C Hamano
  2018-01-06  0:48 ` [PATCH v3 6/7] convert: add support for 'checkout-encoding' attribute lars.schneider
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Torsten Bögershausen <tboegi@web.de>

When calling convert_to_git(), the checksafe parameter defined what
should happen if the EOL conversion (CRLF --> LF --> CRLF) does not
roundtrip cleanly. In addition, it also defined if line endings should
be renormalized (CRLF --> LF) or kept as they are.

checksafe was an safe_crlf enum with these values:
SAFE_CRLF_FALSE:       do nothing in case of EOL roundtrip errors
SAFE_CRLF_FAIL:        die in case of EOL roundtrip errors
SAFE_CRLF_WARN:        print a warning in case of EOL roundtrip errors
SAFE_CRLF_RENORMALIZE: change CRLF to LF
SAFE_CRLF_KEEP_CRLF:   keep all line endings as they are

In some cases the integer value 0 was passed as checksafe parameter
instead of the correct enum value SAFE_CRLF_FALSE. That was no problem
because SAFE_CRLF_FALSE is defined as 0.

FALSE/FAIL/WARN are different from RENORMALIZE and KEEP_CRLF. Therefore,
an enum is not ideal. Let's use a integer bit pattern instead and rename
the parameter to conv_flags to make it more generically usable. This
allows us to extend the bit pattern in a subsequent commit.

Helped-By: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 apply.c        |  6 +++---
 combine-diff.c |  2 +-
 config.c       |  7 +++++--
 convert.c      | 38 +++++++++++++++++++-------------------
 convert.h      | 17 +++++++----------
 diff.c         |  8 ++++----
 environment.c  |  2 +-
 sha1_file.c    | 12 ++++++------
 8 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/apply.c b/apply.c
index 321a9fa68d..f8b67bfee2 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 			 const char *path, struct strbuf *buf)
 {
-	enum safe_crlf safe_crlf = patch->crlf_in_old ?
-		SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+	int conv_flags = patch->crlf_in_old ?
+		CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
 	switch (st->st_mode & S_IFMT) {
 	case S_IFLNK:
 		if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
 		 * should never look at the index when explicit crlf option
 		 * is given.
 		 */
-		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
+		convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
 		return 0;
 	default:
 		return -1;
diff --git a/combine-diff.c b/combine-diff.c
index 2505de119a..19f30c3353 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 			if (is_file) {
 				struct strbuf buf = STRBUF_INIT;
 
-				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
+				if (convert_to_git(&the_index, elem->path, result, len, &buf, global_conv_flags_eol)) {
 					free(result);
 					result = strbuf_detach(&buf, &len);
 					result_size = len;
diff --git a/config.c b/config.c
index e617c2018d..1f003fbb90 100644
--- a/config.c
+++ b/config.c
@@ -1149,11 +1149,14 @@ static int git_default_core_config(const char *var, const char *value)
 	}
 
 	if (!strcmp(var, "core.safecrlf")) {
+		int eol_rndtrp_die;
 		if (value && !strcasecmp(value, "warn")) {
-			safe_crlf = SAFE_CRLF_WARN;
+			global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 			return 0;
 		}
-		safe_crlf = git_config_bool(var, value);
+		eol_rndtrp_die = git_config_bool(var, value);
+		global_conv_flags_eol = eol_rndtrp_die ?
+			CONV_EOL_RNDTRP_DIE : CONV_EOL_RNDTRP_WARN;
 		return 0;
 	}
 
diff --git a/convert.c b/convert.c
index 20d7ab67bd..f39150cde9 100644
--- a/convert.c
+++ b/convert.c
@@ -193,30 +193,30 @@ static enum eol output_eol(enum crlf_action crlf_action)
 	return core_eol;
 }
 
-static void check_safe_crlf(const char *path, enum crlf_action crlf_action,
+static void check_global_conv_flags_eol(const char *path, enum crlf_action crlf_action,
 			    struct text_stat *old_stats, struct text_stat *new_stats,
-			    enum safe_crlf checksafe)
+			    int conv_flags)
 {
 	if (old_stats->crlf && !new_stats->crlf ) {
 		/*
 		 * CRLFs would not be restored by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("CRLF would be replaced by LF in %s."), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("CRLF will be replaced by LF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("CRLF would be replaced by LF in %s."), path);
 	} else if (old_stats->lonelf && !new_stats->lonelf ) {
 		/*
 		 * CRLFs would be added by checkout
 		 */
-		if (checksafe == SAFE_CRLF_WARN)
+		if (conv_flags & CONV_EOL_RNDTRP_DIE)
+			die(_("LF would be replaced by CRLF in %s"), path);
+		else if (conv_flags & CONV_EOL_RNDTRP_WARN)
 			warning(_("LF will be replaced by CRLF in %s.\n"
 				  "The file will have its original line"
 				  " endings in your working directory."), path);
-		else /* i.e. SAFE_CRLF_FAIL */
-			die(_("LF would be replaced by CRLF in %s"), path);
 	}
 }
 
@@ -259,7 +259,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
-		       enum crlf_action crlf_action, enum safe_crlf checksafe)
+		       enum crlf_action crlf_action, int conv_flags)
 {
 	struct text_stat stats;
 	char *dst;
@@ -289,12 +289,12 @@ static int crlf_to_git(const struct index_state *istate,
 		 * unless we want to renormalize in a merge or
 		 * cherry-pick.
 		 */
-		if ((checksafe != SAFE_CRLF_RENORMALIZE) &&
+		if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
 		    has_cr_in_index(istate, path))
 			convert_crlf_into_lf = 0;
 	}
-	if ((checksafe == SAFE_CRLF_WARN ||
-	    (checksafe == SAFE_CRLF_FAIL)) && len) {
+	if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
+	     ((conv_flags & CONV_EOL_RNDTRP_DIE) && len))) {
 		struct text_stat new_stats;
 		memcpy(&new_stats, &stats, sizeof(new_stats));
 		/* simulate "git add" */
@@ -307,7 +307,7 @@ static int crlf_to_git(const struct index_state *istate,
 			new_stats.crlf += new_stats.lonelf;
 			new_stats.lonelf = 0;
 		}
-		check_safe_crlf(path, crlf_action, &stats, &new_stats, checksafe);
+		check_global_conv_flags_eol(path, crlf_action, &stats, &new_stats, conv_flags);
 	}
 	if (!convert_crlf_into_lf)
 		return 0;
@@ -1120,7 +1120,7 @@ const char *get_convert_attr_ascii(const char *path)
 
 int convert_to_git(const struct index_state *istate,
 		   const char *path, const char *src, size_t len,
-                   struct strbuf *dst, enum safe_crlf checksafe)
+		   struct strbuf *dst, int conv_flags)
 {
 	int ret = 0;
 	struct conv_attrs ca;
@@ -1135,8 +1135,8 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
-	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
-		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
+	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
+		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
 			src = dst->buf;
 			len = dst->len;
@@ -1147,7 +1147,7 @@ int convert_to_git(const struct index_state *istate,
 
 void convert_to_git_filter_fd(const struct index_state *istate,
 			      const char *path, int fd, struct strbuf *dst,
-			      enum safe_crlf checksafe)
+			      int conv_flags)
 {
 	struct conv_attrs ca;
 	convert_attrs(&ca, path);
@@ -1158,7 +1158,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
-	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
+	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
 
@@ -1217,7 +1217,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
 		src = dst->buf;
 		len = dst->len;
 	}
-	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
+	return ret | convert_to_git(istate, path, src, len, dst, CONV_EOL_RENORMALIZE);
 }
 
 /*****************************************************************
diff --git a/convert.h b/convert.h
index 4f2da225a8..65ab3e5167 100644
--- a/convert.h
+++ b/convert.h
@@ -8,15 +8,12 @@
 
 struct index_state;
 
-enum safe_crlf {
-	SAFE_CRLF_FALSE = 0,
-	SAFE_CRLF_FAIL = 1,
-	SAFE_CRLF_WARN = 2,
-	SAFE_CRLF_RENORMALIZE = 3,
-	SAFE_CRLF_KEEP_CRLF = 4
-};
+#define CONV_EOL_RNDTRP_DIE   (1<<0) /* Die if CRLF to LF to CRLF is different */
+#define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
+#define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
+#define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
 
-extern enum safe_crlf safe_crlf;
+extern int global_conv_flags_eol;
 
 enum auto_crlf {
 	AUTO_CRLF_FALSE = 0,
@@ -66,7 +63,7 @@ extern const char *get_convert_attr_ascii(const char *path);
 /* returns 1 if *dst was used */
 extern int convert_to_git(const struct index_state *istate,
 			  const char *path, const char *src, size_t len,
-			  struct strbuf *dst, enum safe_crlf checksafe);
+			  struct strbuf *dst, int conv_flags);
 extern int convert_to_working_tree(const char *path, const char *src,
 				   size_t len, struct strbuf *dst);
 extern int async_convert_to_working_tree(const char *path, const char *src,
@@ -85,7 +82,7 @@ static inline int would_convert_to_git(const struct index_state *istate,
 extern void convert_to_git_filter_fd(const struct index_state *istate,
 				     const char *path, int fd,
 				     struct strbuf *dst,
-				     enum safe_crlf checksafe);
+				     int conv_flags);
 extern int would_convert_to_git_filter_fd(const char *path);
 
 /*****************************************************************
diff --git a/diff.c b/diff.c
index 2ebe2227b4..fe3ff6e876 100644
--- a/diff.c
+++ b/diff.c
@@ -3516,13 +3516,13 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 {
 	int size_only = flags & CHECK_SIZE_ONLY;
 	int err = 0;
+	int conv_flags = global_conv_flags_eol;
 	/*
 	 * demote FAIL to WARN to allow inspecting the situation
 	 * instead of refusing.
 	 */
-	enum safe_crlf crlf_warn = (safe_crlf == SAFE_CRLF_FAIL
-				    ? SAFE_CRLF_WARN
-				    : safe_crlf);
+	if (conv_flags & CONV_EOL_RNDTRP_DIE)
+		conv_flags = CONV_EOL_RNDTRP_WARN;
 
 	if (!DIFF_FILE_VALID(s))
 		die("internal error: asking to populate invalid file.");
@@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
 		/*
 		 * Convert from working tree format to canonical git format
 		 */
-		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
+		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, conv_flags)) {
 			size_t size = 0;
 			munmap(s->data, s->size);
 			s->should_munmap = 0;
diff --git a/environment.c b/environment.c
index 8fa032f307..888b873f90 100644
--- a/environment.c
+++ b/environment.c
@@ -49,7 +49,7 @@ enum auto_crlf auto_crlf = AUTO_CRLF_FALSE;
 int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
-enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
+int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/sha1_file.c b/sha1_file.c
index afe4b90f6e..dcb02e9ffd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -75,14 +75,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
 }
 
 
-static enum safe_crlf get_safe_crlf(unsigned flags)
+static int get_conv_flags(unsigned flags)
 {
 	if (flags & HASH_RENORMALIZE)
-		return SAFE_CRLF_RENORMALIZE;
+		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-		return safe_crlf;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
-		return SAFE_CRLF_FALSE;
+		return 0;
 }
 
 
@@ -1694,7 +1694,7 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
 	if ((type == OBJ_BLOB) && path) {
 		struct strbuf nbuf = STRBUF_INIT;
 		if (convert_to_git(&the_index, path, buf, size, &nbuf,
-				   get_safe_crlf(flags))) {
+				   get_conv_flags(flags))) {
 			buf = strbuf_detach(&nbuf, &size);
 			re_allocated = 1;
 		}
@@ -1728,7 +1728,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
 	assert(would_convert_to_git_filter_fd(path));
 
 	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
-				 get_safe_crlf(flags));
+				 get_conv_flags(flags));
 
 	if (write_object)
 		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 6/7] convert: add support for 'checkout-encoding' attribute
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-01-06  0:48 ` [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-06  0:48 ` [PATCH v3 7/7] convert: add tracing for checkout-encoding lars.schneider
  2018-01-07  9:38 ` [PATCH v3 0/7] convert: add support for different encodings Torsten Bögershausen
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to teach Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt |  60 ++++++++++++
 convert.c                       | 190 +++++++++++++++++++++++++++++++++++++-
 convert.h                       |   1 +
 t/t0028-checkout-encoding.sh    | 196 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 446 insertions(+), 1 deletion(-)
 create mode 100755 t/t0028-checkout-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..1bc03e69cb 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`checkout-encoding`
+^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can teach Git the encoding of a file in the working
+directory with the `checkout-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `checkout-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `checkout-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+  errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `checkout-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt		text checkout-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index f39150cde9..13f766d2a2 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -256,6 +257,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as checkout encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as checkout encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	/*
+	 * UTF supports lossless round tripping [1]. UTF to other encoding are
+	 * mostly round trip safe as Unicode aims to be a superset of all other
+	 * character encodings. However, the SHIFT-JIS (Japanese character set)
+	 * is an exception as some codes are not round trip safe [2].
+	 *
+	 * Reverse the transformation of 'dst' and check the result with 'src'
+	 * if content is written to Git. This ensures no information is lost
+	 * during conversion to/from UTF-8.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without iconv error.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			if (conv_flags & CONV_WRITE_OBJECT)
+				die(msg, path, enc->name, default_encoding);
+			else
+				error(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -969,6 +1111,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1024,6 +1191,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1032,8 +1200,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "checkout-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1055,6 +1225,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1135,6 +1306,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1158,6 +1336,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1189,6 +1368,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1655,6 +1840,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.checkout_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
new file mode 100755
index 0000000000..3a9951fdf3
--- /dev/null
+++ b/t/t0028-checkout-encoding.sh
@@ -0,0 +1,196 @@
+#!/bin/sh
+
+test_description='checkout-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 7/7] convert: add tracing for checkout-encoding
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-01-06  0:48 ` [PATCH v3 6/7] convert: add support for 'checkout-encoding' attribute lars.schneider
@ 2018-01-06  0:48 ` lars.schneider
  2018-01-07  9:38 ` [PATCH v3 0/7] convert: add support for different encodings Torsten Bögershausen
  7 siblings, 0 replies; 15+ messages in thread
From: lars.schneider @ 2018-01-06  0:48 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the checkout-encoding
attribute.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                    | 28 ++++++++++++++++++++++++++++
 t/t0028-checkout-encoding.sh |  2 ++
 2 files changed, 30 insertions(+)

diff --git a/convert.c b/convert.c
index 13f766d2a2..525958bb56 100644
--- a/convert.c
+++ b/convert.c
@@ -257,6 +257,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -316,6 +339,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -331,6 +355,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	/*
 	 * UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -356,6 +381,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 					     enc->name, default_encoding,
 					     &re_src_len);
 
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
 		if (!re_src || src_len != re_src_len ||
 		    memcmp(src, re_src, src_len)) {
 			const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
index 3a9951fdf3..5f1c911c07 100755
--- a/t/t0028-checkout-encoding.sh
+++ b/t/t0028-checkout-encoding.sh
@@ -4,6 +4,8 @@ test_description='checkout-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.15.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/7] convert: add support for different encodings
  2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2018-01-06  0:48 ` [PATCH v3 7/7] convert: add tracing for checkout-encoding lars.schneider
@ 2018-01-07  9:38 ` Torsten Bögershausen
  2018-01-08 14:38   ` Lars Schneider
  7 siblings, 1 reply; 15+ messages in thread
From: Torsten Bögershausen @ 2018-01-07  9:38 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, gitster, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Hi,
> 
> Patches 1-5 and 6 are helper functions and preparation.
> Patch 6 is the actual change.
> 
> I am still torn between "checkout-encoding" and "working-tree-encoding"
> as attribute name. I am happy to hear arguments for/against one or the
> other.

checkout-encoding is probably misleading, as it is even the checkin-encoding.

What is wrong with working-tree-encoding ?
I think the 2 "-".

What was wrong with workingtree-encoding ?
Or
workdir-encoding ?



> 
> Changes since v2:
> 
> * Added Torsten's crlfsave refactoring patch (patch 5)
>   @Torsten: I tried to make the commit message more clean, added
>             some comments to and renamed conv_flags_eol to
>             global_conv_flags_eol.
> 
> * Improved documentation and commit message (Torsten)

Good, thanks.
> 
> * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)
> 
> * Set "git config core.eol lf" to made the test run on Windows (Dscho)
> 
> * Made BOM arrays static (Ramsay)


Some comments:

I would like to have the CRLF conversion a little bit more strict -
many users tend to set core.autocrlf=true or write "* text=auto"
in the .gitattributes.
Reading all the effort about BOM markers and UTF-16LE, I think there
should ne some effort to make the line endings round trip.
Therefore I changed convert.c to demand that the "text" attribute
is set to enable CRLF conversions.
(If I had submitted the patch, I would have demanded
"text eol=lf" or "text eol=crlf", but the test case t0028 indicates
that there is a demand to produce line endings as configured in core.eol)

Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to
https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B

Here is a inter-diff against your version:

 diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
 index 1bc03e69c..b8d9f91c8 100644
 --- a/Documentation/gitattributes.txt
 +++ b/Documentation/gitattributes.txt
 @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text processing
  tools (e.g. 'git diff') as well as most Git web front ends do not
  visualize the content.
  
 -In these cases you can teach Git the encoding of a file in the working
 +In these cases you can tell Git the encoding of a file in the working
  directory with the `checkout-encoding` attribute. If a file with this
  attributes is added to Git, then Git reencodes the content from the
  specified encoding to UTF-8 and stores the result in its internal data
 @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you cannot store a file in
  UTF-8 encoding and if you want Git to be able to process the content as
  text.
  
 +Note that when `checkout-encoding` is defined, by default the line
 +endings are not converted. `text=auto` and core.autocrlf are ignored.
 +Set the `text` attribute to enable CRLF conversions.
 +
  Use the following attributes if your '*.txt' files are UTF-16 encoded
 -with byte order mark (BOM) and you want Git to perform automatic line
 -ending conversion based on your platform.
 +with byte order mark (BOM).
  
  ------------------------
 -*.txt		text checkout-encoding=UTF-16
 +*.txt		checkout-encoding=UTF-16
  ------------------------
  
  Use the following attributes if your '*.txt' files are UTF-16 little
 -endian encoded without BOM and you want Git to use Windows line endings
 -in the working directory.
 +endian encoded without BOM and you want Git to use LF in the repo and
 +CRLF in the working directory.
  
  ------------------------
  *.txt 		checkout-encoding=UTF-16LE text eol=CRLF
 diff --git a/convert.c b/convert.c
 index 13f766d2a..1e29f515e 100644
 --- a/convert.c
 +++ b/convert.c
 @@ -221,18 +221,27 @@ static void check_global_conv_flags_eol(const char *path, enum crlf_action crlf_
  	}
  }
  
  
  static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 @@ -432,7 +441,7 @@ static int crlf_to_git(const struct index_state *istate,
  		 * cherry-pick.
  		 */
  		if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
 -		    has_cr_in_index(istate, path))
 +		    has_crlf_in_index(istate, path))
  			convert_crlf_into_lf = 0;
  	}
  	if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
 @@ -1214,9 +1223,28 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
  			ca->crlf_action = git_path_check_crlf(ccheck + 0);
  		ca->ident = git_path_check_ident(ccheck + 1);
  		ca->drv = git_path_check_convert(ccheck + 2);
 +		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
  		if (ca->crlf_action != CRLF_BINARY) {
  			enum eol eol_attr = git_path_check_eol(ccheck + 3);
 -			if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_LF)
 +			if (ca->checkout_encoding) {
 +				enum crlf_action crlf_action = CRLF_BINARY;
 +				/*
 +				 * encoded files don't use auto.
 +				 * 'text' must be specified to
 +				 * do crlf conversions
 +				 */
 +				if (ca->crlf_action == CRLF_TEXT) {
 +					if (eol_attr == EOL_LF)
 +						crlf_action = CRLF_TEXT_INPUT;
 +					else if (eol_attr == EOL_CRLF)
 +						crlf_action = CRLF_TEXT_CRLF;
 +					else if (text_eol_is_crlf())
 +						crlf_action = CRLF_TEXT_CRLF;
 +					else
 +						crlf_action = CRLF_TEXT_INPUT;
 +				}
 +				ca->crlf_action = crlf_action;
 +			} else if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_LF)
  				ca->crlf_action = CRLF_AUTO_INPUT;
  			else if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_CRLF)
  				ca->crlf_action = CRLF_AUTO_CRLF;
 @@ -1225,11 +1253,11 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
  			else if (eol_attr == EOL_CRLF)
  				ca->crlf_action = CRLF_TEXT_CRLF;
  		}
 -		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
  	} else {
  		ca->drv = NULL;
  		ca->crlf_action = CRLF_UNDEFINED;
  		ca->ident = 0;
 +		ca->checkout_encoding = NULL;
  	}
  
  	/* Save attr and make a decision for action */



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/7] convert: add support for different encodings
  2018-01-07  9:38 ` [PATCH v3 0/7] convert: add support for different encodings Torsten Bögershausen
@ 2018-01-08 14:38   ` Lars Schneider
  2018-01-08 18:08     ` Torsten Bögershausen
  0 siblings, 1 reply; 15+ messages in thread
From: Lars Schneider @ 2018-01-08 14:38 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, Git List, Junio C Hamano, Johannes Sixt,
	Eric Sunshine, Jeff King, Ramsay Jones, Johannes.Schindelin,
	Patrick Lühne


> On 07 Jan 2018, at 10:38, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> Hi,
>> 
>> Patches 1-5 and 6 are helper functions and preparation.
>> Patch 6 is the actual change.
>> 
>> I am still torn between "checkout-encoding" and "working-tree-encoding"
>> as attribute name. I am happy to hear arguments for/against one or the
>> other.
> 
> checkout-encoding is probably misleading, as it is even the checkin-encoding.

Yeah, I start to think the same.


> What is wrong with working-tree-encoding ?
> I think the 2 "-".
> 
> What was wrong with workingtree-encoding ?

Yeah, the two dashes are a minor annoyance.

However, consider this:

$ git grep 'working tree' -- '*.txt' | wc -l
     570

$ git grep 'working-tree' -- '*.txt' | wc -l
       6

$ git grep 'workingtree' -- '*.txt' | wc -l
       0


$ git grep 'working tree' -- po | wc -l
     704

$ git grep 'working-tree' -- po | wc -l
       0

$ git grep 'workingtree' -- po | wc -l
       0

I think "working tree" is a pretty established term that
endusers might be able to understand. Therefore, I would
like to go with "working-tree-encoding" as it was written
that way at least 6 times in the Git tree before.

Would that work for you?


> Or
> workdir-encoding ?

Although I like the shortness, the term "workdir" might already 
be occupied [1]. Could that cause confusion?

[1] 4f01748d51 (contrib/workdir: add a simple script to create a working directory, 2007-03-27)


>> 
>> * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)
>> 
>> * Set "git config core.eol lf" to made the test run on Windows (Dscho)
>> 
>> * Made BOM arrays static (Ramsay)
> 
> 
> Some comments:
> 
> I would like to have the CRLF conversion a little bit more strict -
> many users tend to set core.autocrlf=true or write "* text=auto"
> in the .gitattributes.
> Reading all the effort about BOM markers and UTF-16LE, I think there
> should ne some effort to make the line endings round trip.
> Therefore I changed convert.c to demand that the "text" attribute
> is set to enable CRLF conversions.
> (If I had submitted the patch, I would have demanded
> "text eol=lf" or "text eol=crlf", but the test case t0028 indicates
> that there is a demand to produce line endings as configured in core.eol)

But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16
file on Windows with CRLF then it would be nice if Git would automatically
convert the line endings to LF on Linux, no?

IOW: Why should we handle text files that have a defined checkout-encoding
differently compared to UTF-8 encoded text files? Wouldn't that be unexpected
to the user?

Thanks,
Lars



> 
> Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to
> https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B
> 
> Here is a inter-diff against your version:
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 1bc03e69c..b8d9f91c8 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text processing
>  tools (e.g. 'git diff') as well as most Git web front ends do not
>  visualize the content.
> 
> -In these cases you can teach Git the encoding of a file in the working
> +In these cases you can tell Git the encoding of a file in the working

Oops. I meant to change that already. Thanks!

>  directory with the `checkout-encoding` attribute. If a file with this
>  attributes is added to Git, then Git reencodes the content from the
>  specified encoding to UTF-8 and stores the result in its internal data
> @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you cannot store a file in
>  UTF-8 encoding and if you want Git to be able to process the content as
>  text.
> 
> +Note that when `checkout-encoding` is defined, by default the line
> +endings are not converted. `text=auto` and core.autocrlf are ignored.
> +Set the `text` attribute to enable CRLF conversions.
> +
>  Use the following attributes if your '*.txt' files are UTF-16 encoded
> -with byte order mark (BOM) and you want Git to perform automatic line
> -ending conversion based on your platform.
> +with byte order mark (BOM).
> 
>  ------------------------
> -*.txt		text checkout-encoding=UTF-16
> +*.txt		checkout-encoding=UTF-16
>  ------------------------
> 
>  Use the following attributes if your '*.txt' files are UTF-16 little
> -endian encoded without BOM and you want Git to use Windows line endings
> -in the working directory.
> +endian encoded without BOM and you want Git to use LF in the repo and
> +CRLF in the working directory.
> 
>  ------------------------
>  *.txt 		checkout-encoding=UTF-16LE text eol=CRLF
> diff --git a/convert.c b/convert.c
> index 13f766d2a..1e29f515e 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -221,18 +221,27 @@ static void check_global_conv_flags_eol(const char *path, enum crlf_action crlf_
>  	}
>  }
> 
> 
>  static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> @@ -432,7 +441,7 @@ static int crlf_to_git(const struct index_state *istate,
>  		 * cherry-pick.
>  		 */
>  		if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
> -		    has_cr_in_index(istate, path))
> +		    has_crlf_in_index(istate, path))
>  			convert_crlf_into_lf = 0;
>  	}
>  	if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
> @@ -1214,9 +1223,28 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  			ca->crlf_action = git_path_check_crlf(ccheck + 0);
>  		ca->ident = git_path_check_ident(ccheck + 1);
>  		ca->drv = git_path_check_convert(ccheck + 2);
> +		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
>  		if (ca->crlf_action != CRLF_BINARY) {
>  			enum eol eol_attr = git_path_check_eol(ccheck + 3);
> -			if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_LF)
> +			if (ca->checkout_encoding) {
> +				enum crlf_action crlf_action = CRLF_BINARY;
> +				/*
> +				 * encoded files don't use auto.
> +				 * 'text' must be specified to
> +				 * do crlf conversions
> +				 */
> +				if (ca->crlf_action == CRLF_TEXT) {
> +					if (eol_attr == EOL_LF)
> +						crlf_action = CRLF_TEXT_INPUT;
> +					else if (eol_attr == EOL_CRLF)
> +						crlf_action = CRLF_TEXT_CRLF;
> +					else if (text_eol_is_crlf())
> +						crlf_action = CRLF_TEXT_CRLF;
> +					else
> +						crlf_action = CRLF_TEXT_INPUT;
> +				}
> +				ca->crlf_action = crlf_action;
> +			} else if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_LF)
>  				ca->crlf_action = CRLF_AUTO_INPUT;
>  			else if (ca->crlf_action == CRLF_AUTO && eol_attr == EOL_CRLF)
>  				ca->crlf_action = CRLF_AUTO_CRLF;
> @@ -1225,11 +1253,11 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  			else if (eol_attr == EOL_CRLF)
>  				ca->crlf_action = CRLF_TEXT_CRLF;
>  		}
> -		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
>  	} else {
>  		ca->drv = NULL;
>  		ca->crlf_action = CRLF_UNDEFINED;
>  		ca->ident = 0;
> +		ca->checkout_encoding = NULL;
>  	}
> 
>  	/* Save attr and make a decision for action */
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/7] convert: add support for different encodings
  2018-01-08 14:38   ` Lars Schneider
@ 2018-01-08 18:08     ` Torsten Bögershausen
  0 siblings, 0 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2018-01-08 18:08 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Lars Schneider, Git List, Junio C Hamano, Johannes Sixt,
	Eric Sunshine, Jeff King, Ramsay Jones, Johannes.Schindelin,
	Patrick Lühne

On Mon, Jan 08, 2018 at 03:38:48PM +0100, Lars Schneider wrote:
[]
> > Some comments:
> > 
> > I would like to have the CRLF conversion a little bit more strict -
> > many users tend to set core.autocrlf=true or write "* text=auto"
> > in the .gitattributes.
> > Reading all the effort about BOM markers and UTF-16LE, I think there
> > should ne some effort to make the line endings round trip.
> > Therefore I changed convert.c to demand that the "text" attribute
> > is set to enable CRLF conversions.
> > (If I had submitted the patch, I would have demanded
> > "text eol=lf" or "text eol=crlf", but the test case t0028 indicates
> > that there is a demand to produce line endings as configured in core.eol)
> 
> But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16
> file on Windows with CRLF then it would be nice if Git would automatically
> convert the line endings to LF on Linux, no?
> 
> IOW: Why should we handle text files that have a defined checkout-encoding
> differently compared to UTF-8 encoded text files? Wouldn't that be unexpected
> to the user?
> 
> Thanks,
> Lars

The problem is, if user A has core.autocrlf=true and user B
core.autocrlf=false.
(The line endings don't show up as expected at user B)

Having said that in all shortness, you convinced me:
If text=auto, we care about line endings.
If text,  we care about line endings.

If the .gitattributes don't say anything about text,
we don't convert eol.
(In other words: we don't look at core.autocrlf, when checkout-encoding
is defined)

A new branch is push to github/tboegi



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags
  2018-01-06  0:48 ` [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags lars.schneider
@ 2018-01-08 21:28   ` Junio C Hamano
  2018-01-08 22:47     ` Lars Schneider
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2018-01-08 21:28 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> diff --git a/sha1_file.c b/sha1_file.c
> index afe4b90f6e..dcb02e9ffd 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -75,14 +75,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
>  }
>  
>  
> -static enum safe_crlf get_safe_crlf(unsigned flags)
> +static int get_conv_flags(unsigned flags)
>  {
>  	if (flags & HASH_RENORMALIZE)
> -		return SAFE_CRLF_RENORMALIZE;
> +		return CONV_EOL_RENORMALIZE;
>  	else if (flags & HASH_WRITE_OBJECT)
> -		return safe_crlf;
> +		return global_conv_flags_eol | CONV_WRITE_OBJECT;

This macro has not yet introduced at this point (it appears in 6/7
if I am not mistaken).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags
  2018-01-08 21:28   ` Junio C Hamano
@ 2018-01-08 22:47     ` Lars Schneider
  2018-01-08 23:17       ` Junio C Hamano
  2018-01-09  6:20       ` Torsten Bögershausen
  0 siblings, 2 replies; 15+ messages in thread
From: Lars Schneider @ 2018-01-08 22:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 08 Jan 2018, at 22:28, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> diff --git a/sha1_file.c b/sha1_file.c
>> index afe4b90f6e..dcb02e9ffd 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -75,14 +75,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
>> }
>> 
>> 
>> -static enum safe_crlf get_safe_crlf(unsigned flags)
>> +static int get_conv_flags(unsigned flags)
>> {
>> 	if (flags & HASH_RENORMALIZE)
>> -		return SAFE_CRLF_RENORMALIZE;
>> +		return CONV_EOL_RENORMALIZE;
>> 	else if (flags & HASH_WRITE_OBJECT)
>> -		return safe_crlf;
>> +		return global_conv_flags_eol | CONV_WRITE_OBJECT;
> 
> This macro has not yet introduced at this point (it appears in 6/7
> if I am not mistaken).

Nice catch. I'll fix that in the next iteration.

Is it OK if I send the next iteration soon or would you prefer
it if I wait until after 2.16 release?

Plus, is it ok to keep the base of the series or would you prefer
it if I rebase it to the latest master (because of a minor conflict)?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags
  2018-01-08 22:47     ` Lars Schneider
@ 2018-01-08 23:17       ` Junio C Hamano
  2018-01-09  6:20       ` Torsten Bögershausen
  1 sibling, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2018-01-08 23:17 UTC (permalink / raw)
  To: Lars Schneider
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

Lars Schneider <larsxschneider@gmail.com> writes:

> Nice catch. I'll fix that in the next iteration.
>
> Is it OK if I send the next iteration soon or would you prefer
> it if I wait until after 2.16 release?
>
> Plus, is it ok to keep the base of the series or would you prefer
> it if I rebase it to the latest master (because of a minor conflict)?

I do not see this topic as a fix for grave bug that needs to go to
older maintenance track---it is rather a new feature, isn't it?  So
a rebased series that cleanly applies on top of 2.16 final would be
a reasonable way to go forward.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags
  2018-01-08 22:47     ` Lars Schneider
  2018-01-08 23:17       ` Junio C Hamano
@ 2018-01-09  6:20       ` Torsten Bögershausen
  1 sibling, 0 replies; 15+ messages in thread
From: Torsten Bögershausen @ 2018-01-09  6:20 UTC (permalink / raw)
  To: Lars Schneider
  Cc: Junio C Hamano, lars.schneider, git, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin

On Mon, Jan 08, 2018 at 11:47:20PM +0100, Lars Schneider wrote:
> 
> > On 08 Jan 2018, at 22:28, Junio C Hamano <gitster@pobox.com> wrote:
> > 
> > lars.schneider@autodesk.com writes:
> > 
> >> diff --git a/sha1_file.c b/sha1_file.c
> >> index afe4b90f6e..dcb02e9ffd 100644
> >> --- a/sha1_file.c
> >> +++ b/sha1_file.c
> >> @@ -75,14 +75,14 @@ static struct cached_object *find_cached_object(const unsigned char *sha1)
> >> }
> >> 
> >> 
> >> -static enum safe_crlf get_safe_crlf(unsigned flags)
> >> +static int get_conv_flags(unsigned flags)
> >> {
> >> 	if (flags & HASH_RENORMALIZE)
> >> -		return SAFE_CRLF_RENORMALIZE;
> >> +		return CONV_EOL_RENORMALIZE;
> >> 	else if (flags & HASH_WRITE_OBJECT)
> >> -		return safe_crlf;
> >> +		return global_conv_flags_eol | CONV_WRITE_OBJECT;
> > 
> > This macro has not yet introduced at this point (it appears in 6/7
> > if I am not mistaken).
> 
> Nice catch. I'll fix that in the next iteration.
> 
> Is it OK if I send the next iteration soon or would you prefer
> it if I wait until after 2.16 release?
> 
> Plus, is it ok to keep the base of the series or would you prefer
> it if I rebase it to the latest master (because of a minor conflict)?
> 
> Thanks,
> Lars

I noticed the missing macro as well, while doing the rebase
to git.git/master, but forget to mention it here on the list

Lars, if you want, please have a look here:
https://github.com/tboegi/git/tree/180108-1858-For-lars-schneider-encode-V3C

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-01-09  6:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-06  0:48 [PATCH v3 0/7] convert: add support for different encodings lars.schneider
2018-01-06  0:48 ` [PATCH v3 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-01-06  0:48 ` [PATCH v3 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-01-06  0:48 ` [PATCH v3 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-01-06  0:48 ` [PATCH v3 4/7] utf8: add function to detect a missing " lars.schneider
2018-01-06  0:48 ` [PATCH v3 5/7] convert_to_git(): safe_crlf/checksafe becomes int conv_flags lars.schneider
2018-01-08 21:28   ` Junio C Hamano
2018-01-08 22:47     ` Lars Schneider
2018-01-08 23:17       ` Junio C Hamano
2018-01-09  6:20       ` Torsten Bögershausen
2018-01-06  0:48 ` [PATCH v3 6/7] convert: add support for 'checkout-encoding' attribute lars.schneider
2018-01-06  0:48 ` [PATCH v3 7/7] convert: add tracing for checkout-encoding lars.schneider
2018-01-07  9:38 ` [PATCH v3 0/7] convert: add support for different encodings Torsten Bögershausen
2018-01-08 14:38   ` Lars Schneider
2018-01-08 18:08     ` Torsten Bögershausen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).