git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: lars.schneider@autodesk.com
To: git@vger.kernel.org
Cc: gitster@pobox.com, tboegi@web.de, peff@peff.net,
	patrick@luehne.de, Lars Schneider <larsxschneider@gmail.com>
Subject: [PATCH v1] convert: add support for 'encoding' attribute
Date: Mon, 11 Dec 2017 16:50:23 +0100	[thread overview]
Message-ID: <20171211155023.1405-1-lars.schneider@autodesk.com> (raw)

From: Lars Schneider <larsxschneider@gmail.com>

Git and its tools (e.g. git diff) expect all text files in UTF-8
encoding. Git will happily accept content in all other encodings, too,
but it might not be able to process the text (e.g. viewing diffs or
changing line endings).

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Reviewed-by: Patrick Lühne <patrick@luehne.de>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---

Hi,

here is a WIP patch to add text encoding support for files encoded with
something other than UTF-8 [RFC].

The 'encoding' attribute is already used to view blobs in gitk. That
could be a problem as the content is stored in Git with the defined
encoding. This patch would interpret the content as UTF-8 encoded and
it would try to reencode it to the defined encoding on checkout.
Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251").
These folks would see errors like these with my patch:
    error: failed to encode 'foo.bar' from utf-8 to cp1251

A quick search on GitHub reveals 2,722 repositories that use the
'encoding' attribute [1]. Using the GitHub API [2], I identified the
following encodings in all publicly accessible repositories:

    ANSI        <-- garbage (ignore by my implementation)
    cp1251
    cp866
    cp950
    iso8859-1
    koi8-r
    shiftjis    <-- garbage (ignore by my implementation)
    UTF-8       <-- unnecessary (ignore by my implementation)
    utf8        <-- garbage (ignore by my implementation)

TODOs:
    - The iconv docs mention that "errno == EINVAL" is harmless but
      we don't handle that case in utf8.c:reencode_string_iconv()
    - Git does not compile with NO_ICONV=1 right now because of
      compat/precompose_utf8.c. I will send a patch to fix that.

Questions:
    - Should I use warning() or error() in the patch?
      (currently I use the warning)
    - Do you agree with the approach in general?

Thanks,
Lars


[RFC] http://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com
  [1] https://github.com/search?p=1&q=encoding+filename%3Agitattributes&type=Code&utf8=%E2%9C%93
  [2] curl --user larsxschneider:secret -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq
  [3] https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples




Notes:
    Base Ref: master
    Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d
    Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && git checkout afc9e88a4d

 Documentation/gitattributes.txt |  27 ++++++
 convert.c                       | 192 +++++++++++++++++++++++++++++++++++++++-
 t/t0028-encoding.sh             |  60 +++++++++++++
 3 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100755 t/t0028-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..84de2fa49c 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -146,6 +146,33 @@ Unspecified::
 Any other value causes Git to act as if `text` has been left
 unspecified.

+`encoding`
+^^^^^^^^^^
+
+By default Git assumes UTF-8 encoding for text files.  The `encoding`
+attribute sets the encoding to be used in the working directory.
+If the path is added to the index, then Git encodes the content to
+UTF-8.  On checkout the content is encoded back to the original
+encoding.  Consequently, you can use all built-in Git text processing
+tools (e.g. git diff, line ending conversions, etc.) with your
+non-UTF-8 encoded file.
+
+Please note that re-encoding content can cause errors and requires
+resources. Use the `encoding` attribute only if you cannot store
+a file in UTF-8 encoding and if you want Git to be able to process
+the text.
+
+------------------------
+*.txt		text encoding=UTF-16
+------------------------
+
+All `iconv` encodings with a stable round-trip conversion to and from
+UTF-8 are supported.  You can see a full list with the following command:
+
+------------------------
+iconv --list
+------------------------
+
 `eol`
 ^^^^^

diff --git a/convert.c b/convert.c
index 20d7ab67bd..ee19c17104 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"

 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,

 }

+#ifdef NO_ICONV
+#ifndef _ICONV_T
+/* The type is just a placeholder and not actually used. */
+typedef void* iconv_t;
+#endif
+#endif
+
+static struct encoding {
+	const char *name;
+	iconv_t to_git;       /* conversion to Git's canonical encoding (UTF-8) */
+	iconv_t to_worktree;  /* conversion to user-defined encoding */
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "utf-8";
+static iconv_t invalid_conversion = (iconv_t)-1;
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc)
+{
+#ifndef NO_ICONV
+	char *dst, *re_src;
+	int dst_len, re_src_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (enc->to_git == invalid_conversion) {
+		enc->to_git = iconv_open(default_encoding, encoding->name);
+		if (enc->to_git == invalid_conversion)
+			warning(_("unsupported encoding %s"), encoding->name);
+	}
+
+	if (enc->to_worktree == invalid_conversion)
+		enc->to_worktree = iconv_open(encoding->name, default_encoding);
+
+	/*
+	 * Do nothing if the encoding is not supported. This could happen in
+	 * two cases:
+	 *   (1) The encoding is garbage. That is no problem as we would not
+	 *       encode the content to UTF-8 on "add" and we would not encode
+	 *       it back on "checkout".
+	 *   (2) Git users use different iconv versions that support different
+	 *       encodings. This could lead to problems, as the content might
+	 *       not be encoded on "add" but encoded back on "checkout" (or
+	 *       the other way around).
+	 * We print a one-time warning to the user in both cases above.
+	 */
+	if (enc->to_git == invalid_conversion || enc->to_worktree == invalid_conversion)
+		return 0;
+
+	dst = reencode_string_iconv(src, src_len, enc->to_git, &dst_len);
+	if (!dst)
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		die(_("failed to encode '%s' from %s to %s"),
+			path, enc->name, default_encoding);
+
+	/*
+	 * Encode dst back to ensure no information is lost. This wastes
+	 * a few cycles as most conversions are round trip conversion
+	 * safe. However, content that has an invalid encoding might not
+	 * match its original byte sequence after the UTF-8 conversion
+	 * round trip. Let's play safe here and check the round trip
+	 * conversion.
+	 */
+	re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, &re_src_len);
+	if (!re_src || strcmp(src, re_src)) {
+		die(_("encoding '%s' from %s to %s and back is not the same"),
+			path, enc->name, default_encoding);
+	}
+	free(re_src);
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+#else
+	warning(_("cannot encode '%s' from %s to %s because "
+		"your Git was not compiled with encoding support"),
+		path, enc->name, default_encoding);
+	return 0;
+#endif
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+#ifndef NO_ICONV
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	if (enc->to_worktree == invalid_conversion) {
+		enc->to_worktree = iconv_open(encoding->name, default_encoding);
+		if (enc->to_worktree == invalid_conversion)
+			warning("unsupported encoding %s", encoding->name);
+	}
+
+	/*
+	 * Do nothing if the encoding is not supported.
+	 * See detailed explanation in encode_to_git().
+	 */
+	if (enc->to_worktree == invalid_conversion)
+		return 0;
+
+	dst = reencode_string_iconv(src, src_len, enc->to_worktree, &dst_len);
+	if (!dst) {
+		warning("failed to encode '%s' from %s to %s",
+			path, default_encoding, encoding->name);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+#else
+	warning(_("cannot encode '%s' from %s to %s because "
+		"your Git was not compiled with encoding support"),
+		path, default_encoding, encoding->name);
+	return 0;
+#endif
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -969,6 +1113,32 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }

+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_UNSET(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(struct convert_driver));
+	enc->name = xstrdup(value);
+	enc->to_git = invalid_conversion;
+	enc->to_worktree = invalid_conversion;
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1024,6 +1194,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1032,8 +1203,9 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)

 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "encoding", NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}

@@ -1055,6 +1227,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1135,6 +1308,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.encoding);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
 		if (ret && dst) {
@@ -1158,6 +1338,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);

+	encode_to_git(path, dst->buf, dst->len, dst, ca.encoding);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1189,6 +1370,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}

+	ret |= encode_to_worktree(path, src, len, dst, ca.encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1655,6 +1842,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;

+	if (ca.encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;

diff --git a/t/t0028-encoding.sh b/t/t0028-encoding.sh
new file mode 100755
index 0000000000..24f24f9e0d
--- /dev/null
+++ b/t/t0028-encoding.sh
@@ -0,0 +1,60 @@
+#!/bin/sh
+
+test_description='encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+
+	text="hallo there!\ncan you read me?" &&
+
+	echo "*.utf16 text encoding=utf-16" >.gitattributes &&
+	printf "$text" >t.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >t.utf16.raw &&
+	cp t.utf16.raw t.utf16 &&
+
+	git add .gitattributes t.utf16.raw t.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :t.utf16 >t.utf16.git &&
+	test_cmp_bin t.utf8.raw t.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm t.utf16 &&
+	git checkout t.utf16 &&
+	test_cmp_bin t.utf16.raw t.utf16
+'
+
+test_expect_success 'warn if an unsupported encoding is used' '
+	echo "*.garbage text encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	git add t.garbage 2>error.out &&
+	test_i18ngrep "warning: unsupported encoding" error.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'fail if files with invalid encoding are added' '
+	printf "\0\0h\0a" >error.utf16 &&
+	# The test string encoding would fail
+	# test_must_fail iconv -f utf-16 -t utf-8 error.utf16 &&
+	test_must_fail git add error.utf16
+'
+
+# Some sequences might trigger errno == E2BIG in reencode_string_iconv, utf.8.
+# This would cause no error on "git add" and, consequently, the Git internal
+# UTF-8 encoded blob would contain garbage. Hence, the worktree file after a
+# checkout would contain garbage, too. This garbage would not match the file
+# that was initially added.
+test_expect_success 'fail if encoding from X to UTF-8 and back to X is not the same' '
+	printf "\xc3\x28" >error.utf16 &&
+	# The test string re-encoding would fail
+	# iconv -f utf-16 -t utf-8 error.utf16 | iconv -f utf-8 -t utf-16 &&
+	test_must_fail git add error.utf16
+'
+
+test_done

base-commit: 95ec6b1b3393eb6e26da40c565520a8db9796e9f
--
2.15.1


             reply	other threads:[~2017-12-11 15:50 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-11 15:50 lars.schneider [this message]
2017-12-11 18:39 ` [PATCH v1] convert: add support for 'encoding' attribute Eric Sunshine
2017-12-11 23:47   ` Lars Schneider
2017-12-11 23:58     ` Eric Sunshine
2017-12-12 10:58       ` Lars Schneider
2017-12-11 20:47 ` Johannes Sixt
2017-12-11 23:42   ` Lars Schneider
2017-12-12  0:59     ` Junio C Hamano
2017-12-12  7:15       ` Johannes Sixt
2017-12-12 10:55         ` Lars Schneider
2017-12-12 19:31           ` Junio C Hamano
2017-12-13 17:57             ` Lars Schneider
2017-12-13 18:11               ` Junio C Hamano
2017-12-13 23:02                 ` Lars Schneider
2017-12-14 23:01                   ` Junio C Hamano
2017-12-12  7:09     ` Johannes Sixt
2017-12-18 10:13   ` Torsten Bögershausen
2017-12-18 13:12     ` Jeff King
2017-12-23  8:08       ` Torsten Bögershausen
2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
2018-02-26 18:43         ` Peter Krefting
2018-02-27 22:39         ` Jeff King
2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
2017-12-18 21:55     ` Johannes Sixt
2017-12-15  9:58 ` Jeff King
2017-12-18 10:54   ` Lars Schneider
2017-12-18 12:59     ` Jeff King
2017-12-17 17:14 ` Torsten Bögershausen
2017-12-28 16:14   ` Lars Schneider
2017-12-29 12:59     ` Torsten Bögershausen
2017-12-29 13:56       ` Lars Schneider
2018-01-03 19:15       ` Junio C Hamano
2018-01-03 20:45         ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171211155023.1405-1-lars.schneider@autodesk.com \
    --to=lars.schneider@autodesk.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=patrick@luehne.de \
    --cc=peff@peff.net \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).