* [PATCH v5 0/7] convert: add support for different encodings
2018-01-23 20:53 ` Junio C Hamano
@ 2018-01-29 20:18 ` tboegi
2018-01-29 20:18 ` [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() tboegi
` (6 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:18 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Torsten Bögershausen
From: Torsten Bögershausen <tboegi@web.de>
Take V4 from Lars, manually integrated the V2 squash patch,
so a review would be good.
Add my "comments" as a patch, see 7/7 (and this is more like an RFC)
This needs to go on top of tb/crlf-conv-flags
Lars Schneider (6):
strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
strbuf: add xstrdup_toupper()
utf8: add function to detect prohibited UTF-16/32 BOM
utf8: add function to detect a missing UTF-16/32 BOM
convert: add 'working-tree-encoding' attribute
convert: add tracing for 'working-tree-encoding' attribute
Torsten Bögershausen (1):
Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
Documentation/gitattributes.txt | 63 +++++++++++
convert.c | 233 ++++++++++++++++++++++++++++++++++++++-
convert.h | 1 +
sha1_file.c | 2 +-
strbuf.c | 13 ++-
strbuf.h | 1 +
t/t0028-working-tree-encoding.sh | 198 +++++++++++++++++++++++++++++++++
utf8.c | 37 +++++++
utf8.h | 25 +++++
9 files changed, 567 insertions(+), 6 deletions(-)
create mode 100755 t/t0028-working-tree-encoding.sh
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
2018-01-23 20:53 ` Junio C Hamano
2018-01-29 20:18 ` [PATCH v5 0/7] " tboegi
@ 2018-01-29 20:18 ` tboegi
2018-01-29 20:19 ` [PATCH v5 2/7] strbuf: add xstrdup_toupper() tboegi
` (5 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:18 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.
Remove the unnecessary assignment.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
strbuf.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/strbuf.c b/strbuf.c
index 8007be8fb..490f7850e 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
- result[i] = '\0';
return result;
}
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v5 2/7] strbuf: add xstrdup_toupper()
2018-01-23 20:53 ` Junio C Hamano
2018-01-29 20:18 ` [PATCH v5 0/7] " tboegi
2018-01-29 20:18 ` [PATCH v5 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-29 20:19 ` [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
` (4 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().
This function is used in a subsequent commit.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
strbuf.c | 12 ++++++++++++
strbuf.h | 1 +
2 files changed, 13 insertions(+)
diff --git a/strbuf.c b/strbuf.c
index 490f7850e..a20af696b 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
}
+char *xstrdup_toupper(const char *string)
+{
+ char *result;
+ size_t len, i;
+
+ len = strlen(string);
+ result = xmallocz(len);
+ for (i = 0; i < len; i++)
+ result[i] = toupper(string[i]);
+ return result;
+}
+
char *xstrvfmt(const char *fmt, va_list ap)
{
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d6..df7ced53e 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
extern int fprintf_ln(FILE *fp, const char *fmt, ...);
char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
/**
* Create a newly allocated string using printf format. You can do this easily
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
2018-01-23 20:53 ` Junio C Hamano
` (2 preceding siblings ...)
2018-01-29 20:19 ` [PATCH v5 2/7] strbuf: add xstrdup_toupper() tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-29 20:19 ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
` (3 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.
This function is used in a subsequent commit.
[1] http://unicode.org/faq/utf_bom.html#bom10
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
utf8.c | 24 ++++++++++++++++++++++++
utf8.h | 9 +++++++++
2 files changed, 33 insertions(+)
diff --git a/utf8.c b/utf8.c
index 2c27ce013..914881cd1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
}
#endif
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+ return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+ return (
+ (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+ has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+ ) || (
+ (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+ has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+ );
+}
+
/*
* Returns first character length in bytes for multi-byte `text` according to
* `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a8..4711429af 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
const char *s);
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
#endif
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
2018-01-23 20:53 ` Junio C Hamano
` (3 preceding siblings ...)
2018-01-29 20:19 ` [PATCH v5 3/7] utf8: add function to detect prohibited UTF-16/32 BOM tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-30 19:15 ` Junio C Hamano
2018-01-29 20:19 ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
` (2 subsequent siblings)
7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
has_missing_utf_bom() function returns true if a required BOM is
missing.
The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.
This function is used in a subsequent commit.
[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
utf8.c | 13 +++++++++++++
utf8.h | 16 ++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/utf8.c b/utf8.c
index 914881cd1..f033fec1c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
);
}
+int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+{
+ return (
+ !strcmp(enc, "UTF-16") &&
+ !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+ has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+ ) || (
+ !strcmp(enc, "UTF-32") &&
+ !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+ has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+ );
+}
+
/*
* Returns first character length in bytes for multi-byte `text` according to
* `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af..26b5e9185 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
*/
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+
#endif
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
2018-01-29 20:19 ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
@ 2018-01-30 19:15 ` Junio C Hamano
2018-01-30 20:58 ` Lars Schneider
0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 19:15 UTC (permalink / raw)
To: tboegi
Cc: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider, Lars Schneider
tboegi@web.de writes:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> If the endianness is not defined in the encoding name, then let's
> be strict and require a BOM to avoid any encoding confusion. The
> has_missing_utf_bom() function returns true if a required BOM is
> missing.
>
> The Unicode standard instructs to assume big-endian if there in no BOM
> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
> in HTML5 recommends to assume little-endian to "deal with deployed
> content" [3]. Strictly requiring a BOM seems to be the safest option
> for content in Git.
I do not have strong opinion on encoding such policy-ish behaviour
as our default, but am I alone to find that "has missing X" is a
confusing name for a helper function? "is missing X" (or "lacks
X") is a bit more understandable, I guess.
> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
> +{
> + return (
> + !strcmp(enc, "UTF-16") &&
> + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
> + ) || (
> + !strcmp(enc, "UTF-32") &&
> + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
> + );
> +}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
2018-01-30 19:15 ` Junio C Hamano
@ 2018-01-30 20:58 ` Lars Schneider
2018-01-30 21:54 ` Junio C Hamano
0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 20:58 UTC (permalink / raw)
To: Junio C Hamano
Cc: Torsten Bögershausen, Lars Schneider, Git List,
Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
> On 30 Jan 2018, at 20:15, Junio C Hamano <gitster@pobox.com> wrote:
>
> tboegi@web.de writes:
>
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> If the endianness is not defined in the encoding name, then let's
>> be strict and require a BOM to avoid any encoding confusion. The
>> has_missing_utf_bom() function returns true if a required BOM is
>> missing.
>>
>> The Unicode standard instructs to assume big-endian if there in no BOM
>> for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
>> in HTML5 recommends to assume little-endian to "deal with deployed
>> content" [3]. Strictly requiring a BOM seems to be the safest option
>> for content in Git.
>
> I do not have strong opinion on encoding such policy-ish behaviour
> as our default, but am I alone to find that "has missing X" is a
> confusing name for a helper function? "is missing X" (or "lacks
> X") is a bit more understandable, I guess.
That might be a german/english translation thingy but I think I get
your point. "has" implies there is something and "missing" implies
there is nothing :)
"is_missing_utf_bom()" might be even a bit unspecific as UTF-8
is usually missing a UTF BOM but the function would still return
"false". Therefore, "is_missing_required_utf_bom()" might be
lengthy but should fit.
OK for you?
- Lars
>
>> +int has_missing_utf_bom(const char *enc, const char *data, size_t len)
>> +{
>> + return (
>> + !strcmp(enc, "UTF-16") &&
>> + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> + ) || (
>> + !strcmp(enc, "UTF-32") &&
>> + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> + );
>> +}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 4/7] utf8: add function to detect a missing UTF-16/32 BOM
2018-01-30 20:58 ` Lars Schneider
@ 2018-01-30 21:54 ` Junio C Hamano
0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 21:54 UTC (permalink / raw)
To: Lars Schneider
Cc: Torsten Bögershausen, Lars Schneider, Git List,
Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
Lars Schneider <larsxschneider@gmail.com> writes:
> "false". Therefore, "is_missing_required_utf_bom()" might be
> lengthy but should fit.
Thanks, sounds understandable a lot better than the original ;-)
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-23 20:53 ` Junio C Hamano
` (4 preceding siblings ...)
2018-01-29 20:19 ` [PATCH v5 4/7] utf8: add function to detect a missing " tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-30 20:05 ` Junio C Hamano
2018-01-29 20:19 ` [PATCH v5 6/7] convert: add tracing for " tboegi
2018-01-29 20:19 ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.
Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
Documentation/gitattributes.txt | 60 ++++++++++++
convert.c | 190 ++++++++++++++++++++++++++++++++++++-
convert.h | 1 +
sha1_file.c | 2 +-
t/t0028-working-tree-encoding.sh | 196 +++++++++++++++++++++++++++++++++++++++
5 files changed, 447 insertions(+), 2 deletions(-)
create mode 100755 t/t0028-working-tree-encoding.sh
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81..a8dbf4be3 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,66 @@ few exceptions. Even though...
catch potential problems early, safety triggers.
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files. All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attributes is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8 and stores the result in its internal data
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+ will checkout the respective files UTF-8 encoded and not in the
+ expected encoding. Consequently, these files will appear different
+ which typically causes trouble. This is in particular the case for
+ older Git versions and alternative Git implementations such as JGit
+ or libgit2 (as of January 2018).
+
+- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
+ errors as the conversion might not be round trip safe.
+
+- Reencoding content requires resources that might slow down certain
+ Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file in
+UTF-8 encoding and if you want Git to be able to process the content as
+text.
+
+Use the following attributes if your '*.txt' files are UTF-16 encoded
+with byte order mark (BOM) and you want Git to perform automatic line
+ending conversion based on your platform.
+
+------------------------
+*.txt text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.txt' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory.
+
+------------------------
+*.txt working-tree-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+
`ident`
^^^^^^^
diff --git a/convert.c b/convert.c
index b976eb968..0c372069b 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
#include "sigchain.h"
#include "pkt-line.h"
#include "sub-process.h"
+#include "utf8.h"
/*
* convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
}
+static struct encoding {
+ const char *name;
+ struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+ struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+ char *dst;
+ int dst_len;
+
+ /*
+ * No encoding is specified or there is nothing to encode.
+ * Tell the caller that the content was not modified.
+ */
+ if (!enc || (src && !src_len))
+ return 0;
+
+ /*
+ * Looks like we got called from "would_convert_to_git()".
+ * This means Git wants to know if it would encode (= modify!)
+ * the content. Let's answer with "yes", since an encoding was
+ * specified.
+ */
+ if (!buf && !src)
+ return 1;
+
+ if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+ const char *error_msg = _(
+ "BOM is prohibited for '%s' if encoded as %s");
+ const char *advise_msg = _(
+ "You told Git to treat '%s' as %s. A byte order mark "
+ "(BOM) is prohibited with this encoding. Either use "
+ "%.6s as working tree encoding or remove the BOM from the "
+ "file.");
+
+ advise(advise_msg, path, enc->name, enc->name, enc->name);
+ if (conv_flags & CONV_WRITE_OBJECT)
+ die(error_msg, path, enc->name);
+ else
+ error(error_msg, path, enc->name);
+
+
+ } else if (has_missing_utf_bom(enc->name, src, src_len)) {
+ const char *error_msg = _(
+ "BOM is required for '%s' if encoded as %s");
+ const char *advise_msg = _(
+ "You told Git to treat '%s' as %s. A byte order mark "
+ "(BOM) is required with this encoding. Either use "
+ "%sBE/%sLE as working tree encoding or add a BOM to the "
+ "file.");
+ advise(advise_msg, path, enc->name, enc->name, enc->name);
+ if (conv_flags & CONV_WRITE_OBJECT)
+ die(error_msg, path, enc->name);
+ else
+ error(error_msg, path, enc->name);
+ }
+
+ dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+ &dst_len);
+ if (!dst) {
+ /*
+ * We could add the blob "as-is" to Git. However, on checkout
+ * we would try to reencode to the original encoding. This
+ * would fail and we would leave the user with a messed-up
+ * working tree. Let's try to avoid this by screaming loud.
+ */
+ const char* msg = _("failed to encode '%s' from %s to %s");
+ if (conv_flags & CONV_WRITE_OBJECT)
+ die(msg, path, enc->name, default_encoding);
+ else
+ error(msg, path, enc->name, default_encoding);
+ }
+
+ /*
+ * UTF supports lossless round tripping [1]. UTF to other encoding are
+ * mostly round trip safe as Unicode aims to be a superset of all other
+ * character encodings. However, the SHIFT-JIS (Japanese character set)
+ * is an exception as some codes are not round trip safe [2].
+ *
+ * Reverse the transformation of 'dst' and check the result with 'src'
+ * if content is written to Git. This ensures no information is lost
+ * during conversion to/from UTF-8.
+ *
+ * Please note, the code below is not tested because I was not able to
+ * generate a faulty round trip without iconv error.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen2
+ * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+ */
+ if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+ char *re_src;
+ int re_src_len;
+
+ re_src = reencode_string_len(dst, dst_len,
+ enc->name, default_encoding,
+ &re_src_len);
+
+ if (!re_src || src_len != re_src_len ||
+ memcmp(src, re_src, src_len)) {
+ const char* msg = _("encoding '%s' from %s to %s and "
+ "back is not the same");
+ if (conv_flags & CONV_WRITE_OBJECT)
+ die(msg, path, enc->name, default_encoding);
+ else
+ error(msg, path, enc->name, default_encoding);
+ }
+
+ free(re_src);
+ }
+
+ strbuf_attach(buf, dst, dst_len, dst_len + 1);
+ return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+ struct strbuf *buf, struct encoding *enc)
+{
+ char *dst;
+ int dst_len;
+
+ /*
+ * No encoding is specified or there is nothing to encode.
+ * Tell the caller that the content was not modified.
+ */
+ if (!enc || (src && !src_len))
+ return 0;
+
+ dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+ &dst_len);
+ if (!dst) {
+ error("failed to encode '%s' from %s to %s",
+ path, enc->name, default_encoding);
+ return 0;
+ }
+
+ strbuf_attach(buf, dst, dst_len, dst_len + 1);
+ return 1;
+}
+
static int crlf_to_git(const struct index_state *istate,
const char *path, const char *src, size_t len,
struct strbuf *buf,
@@ -978,6 +1120,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
return 1;
}
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+ const char *value = check->value;
+ struct encoding *enc;
+
+ if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+ !strlen(value))
+ return NULL;
+
+ for (enc = encoding; enc; enc = enc->next)
+ if (!strcasecmp(value, enc->name))
+ return enc;
+
+ /* Don't encode to the default encoding */
+ if (!strcasecmp(value, default_encoding))
+ return NULL;
+
+ enc = xcalloc(1, sizeof(struct convert_driver));
+ enc->name = xstrdup_toupper(value); /* aways use upper case names! */
+ *encoding_tail = enc;
+ encoding_tail = &(enc->next);
+
+ return enc;
+}
+
static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
{
const char *value = check->value;
@@ -1033,6 +1200,7 @@ struct conv_attrs {
enum crlf_action attr_action; /* What attr says */
enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
int ident;
+ struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
};
static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1209,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
if (!check) {
check = attr_check_initl("crlf", "ident", "filter",
- "eol", "text", NULL);
+ "eol", "text", "working-tree-encoding",
+ NULL);
user_convert_tail = &user_convert;
+ encoding_tail = &encoding;
git_config(read_convert_config, NULL);
}
@@ -1064,6 +1234,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
else if (eol_attr == EOL_CRLF)
ca->crlf_action = CRLF_TEXT_CRLF;
}
+ ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
} else {
ca->drv = NULL;
ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1315,13 @@ int convert_to_git(const struct index_state *istate,
src = dst->buf;
len = dst->len;
}
+
+ ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+ if (ret && dst) {
+ src = dst->buf;
+ len = dst->len;
+ }
+
if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
if (ret && dst) {
@@ -1167,6 +1345,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
die("%s: clean filter '%s' failed", path, ca.drv->name);
+ encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
}
@@ -1198,6 +1377,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
}
}
+ ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+ if (ret) {
+ src = dst->buf;
+ len = dst->len;
+ }
+
ret_filter = apply_filter(
path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1849,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
return NULL;
+ if (ca.checkout_encoding)
+ return NULL;
+
if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
return NULL;
diff --git a/convert.h b/convert.h
index 65ab3e516..1d9539ed0 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
#define CONV_EOL_RNDTRP_WARN (1<<1) /* Warn if CRLF to LF to CRLF is different */
#define CONV_EOL_RENORMALIZE (1<<2) /* Convert CRLF to LF */
#define CONV_EOL_KEEP_CRLF (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT (1<<4) /* Content is written to the index */
extern int global_conv_flags_eol;
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada..e2f319d67 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
if (flags & HASH_RENORMALIZE)
return CONV_EOL_RENORMALIZE;
else if (flags & HASH_WRITE_OBJECT)
- return global_conv_flags_eol;
+ return global_conv_flags_eol | CONV_WRITE_OBJECT;
else
return 0;
}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 000000000..4d85b4277
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,196 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+ git config core.eol lf &&
+
+ text="hallo there!\ncan you read me?" &&
+ echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+ printf "$text" >test.utf8.raw &&
+ printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+ cp test.utf16.raw test.utf16 &&
+
+ git add .gitattributes test.utf16 &&
+ git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+ git cat-file -p :test.utf16 >test.utf16.git &&
+ test_cmp_bin test.utf8.raw test.utf16.git &&
+ rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+ rm test.utf16 &&
+ git checkout test.utf16 &&
+ test_cmp_bin test.utf16.raw test.utf16 &&
+
+ # cleanup
+ rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+ printf "\0a\0b\0c" >nobom.utf16be.raw &&
+ printf "a\0b\0c\0" >nobom.utf16le.raw &&
+ printf "\376\777\0a\0b\0c" >bebom.utf16be.raw &&
+ printf "\777\376a\0b\0c\0" >lebom.utf16le.raw &&
+
+ printf "\0\0\0a\0\0\0b\0\0\0c" >nobom.utf32be.raw &&
+ printf "a\0\0\0b\0\0\0c\0\0\0" >nobom.utf32le.raw &&
+ printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+ printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+ echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+ echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+ echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+ echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+ # Here we add a UTF-16 files with BOM (big-endian and little-endian)
+ # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+ # the BOM is prohibited.
+ cp bebom.utf16be.raw bebom.utf16be &&
+ test_must_fail git add bebom.utf16be 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+ cp lebom.utf16le.raw lebom.utf16be &&
+ test_must_fail git add lebom.utf16be 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+ cp bebom.utf16be.raw bebom.utf16le &&
+ test_must_fail git add bebom.utf16le 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+ cp lebom.utf16le.raw lebom.utf16le &&
+ test_must_fail git add lebom.utf16le 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+ # ... and the same for UTF-32
+ cp bebom.utf32be.raw bebom.utf32be &&
+ test_must_fail git add bebom.utf32be 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+ cp lebom.utf32le.raw lebom.utf32be &&
+ test_must_fail git add lebom.utf32be 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+ cp bebom.utf32be.raw bebom.utf32le &&
+ test_must_fail git add bebom.utf32le 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+ cp lebom.utf32le.raw lebom.utf32le &&
+ test_must_fail git add lebom.utf32le 2>err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+ # cleanup
+ git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+ echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+ cp nobom.utf16be.raw nobom.utf16 &&
+ test_must_fail git add nobom.utf16 2>err.out &&
+ test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+ cp nobom.utf16le.raw nobom.utf16 &&
+ test_must_fail git add nobom.utf16 2>err.out &&
+ test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+ cp nobom.utf32be.raw nobom.utf32 &&
+ test_must_fail git add nobom.utf32 2>err.out &&
+ test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+ cp nobom.utf32le.raw nobom.utf32 &&
+ test_must_fail git add nobom.utf32 2>err.out &&
+ test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+ # cleanup
+ rm nobom.utf16 nobom.utf32 &&
+ git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+ printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+ printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+ cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+ cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+ cp crlf.utf16.raw eol.utf16 &&
+
+ git add eol.utf16 &&
+ git commit -m eol &&
+
+ # UTF-16 with CRLF (Windows line endings)
+ rm eol.utf16 &&
+ git -c core.eol=crlf checkout eol.utf16 &&
+ test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+ # UTF-16 with LF (Unix line endings)
+ rm eol.utf16 &&
+ git -c core.eol=lf checkout eol.utf16 &&
+ test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+ rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+ # cleanup
+ git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+ echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+ printf "nothing" >t.nothing &&
+ git add t.nothing &&
+
+ echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+ printf "garbage" >t.garbage &&
+ test_must_fail git add t.garbage 2>err.out &&
+ test_i18ngrep "fatal: failed to encode" err.out &&
+
+ # cleanup
+ rm err.out &&
+ git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+ BEFORE_STATE=$(git rev-parse HEAD) &&
+
+ # Skip the UTF-16 filter for the added file
+ # This simulates a Git version that has no working tree encoding support
+ echo "hallo" >nonsense.utf16 &&
+ TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+ git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+ COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+ git update-ref refs/heads/master $COMMIT &&
+
+ test_must_fail git checkout HEAD^ 2>err.out &&
+ test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+ # cleanup
+ rm err.out &&
+ git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+ BEFORE_STATE=$(git rev-parse HEAD) &&
+
+ # Skip the UTF-16 filter for the added file
+ # This simulates a Git version that has no checkoutEncoding support
+ cp nobom.utf16be.raw nonsense.utf16 &&
+ TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+ git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+ COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+ git update-ref refs/heads/master $COMMIT &&
+
+ git diff 2>err.out &&
+ test_i18ngrep "error: BOM is required" err.out &&
+
+ # cleanup
+ rm err.out &&
+ git reset --hard $BEFORE_STATE
+'
+
+test_done
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-29 20:19 ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
@ 2018-01-30 20:05 ` Junio C Hamano
2018-01-30 20:31 ` Lars Schneider
0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 20:05 UTC (permalink / raw)
To: tboegi
Cc: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider, Lars Schneider
tboegi@web.de writes:
> + if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
> + char *re_src;
> + int re_src_len;
I think it is a bad idea to
(1) not check without CONV_WRITE_OBJECT here.
(2) hardcode SJIS and do this always and to SJIS alone.
For (1), a fix would be obvious (and that will resurrect the dead
code below).
For (2), perhaps introduce a multi-value configuration variable
core.checkRoundtripEncoding, whose default value consists of just
SJIS, but allow users to add or clear it?
> + re_src = reencode_string_len(dst, dst_len,
> + enc->name, default_encoding,
> + &re_src_len);
> +
> + if (!re_src || src_len != re_src_len ||
> + memcmp(src, re_src, src_len)) {
> + const char* msg = _("encoding '%s' from %s to %s and "
> + "back is not the same");
> + if (conv_flags & CONV_WRITE_OBJECT)
> + die(msg, path, enc->name, default_encoding);
> + else
> + error(msg, path, enc->name, default_encoding);
The "error" side of this inner if() is dead code, I think.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-30 20:05 ` Junio C Hamano
@ 2018-01-30 20:31 ` Lars Schneider
2018-01-30 21:56 ` Junio C Hamano
0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 20:31 UTC (permalink / raw)
To: Junio C Hamano
Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
Jeff King, Ramsay Jones, Johannes.Schindelin
> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
>
> tboegi@web.de writes:
>
>> + if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>> + char *re_src;
>> + int re_src_len;
>
> I think it is a bad idea to
>
> (1) not check without CONV_WRITE_OBJECT here.
The idea is to perform the roundtrip check *only* if we
actually write to Git. In all other cases we don't care
if the encoding roundtrips.
"git checkout" is such a case where we don't care as
noted by Peff here:
https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
Do you agree?
> (2) hardcode SJIS and do this always and to SJIS alone.
>
> ...
>
> For (2), perhaps introduce a multi-value configuration variable
> core.checkRoundtripEncoding, whose default value consists of just
> SJIS, but allow users to add or clear it?
Well, in that case I would make it simpler and make
core.checkRoundtripEncoding a boolean that applies to all encodings
if enabled. We could make even simpler than that by removing the entire
roundtrip check. The thing is, I was not able to come up with a
sequence that would not generate a iconv error *and* not round trip.
Would that be ok for you to remove all that roundtrip checking code?
>> + re_src = reencode_string_len(dst, dst_len,
>> + enc->name, default_encoding,
>> + &re_src_len);
>> +
>> + if (!re_src || src_len != re_src_len ||
>> + memcmp(src, re_src, src_len)) {
>> + const char* msg = _("encoding '%s' from %s to %s and "
>> + "back is not the same");
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(msg, path, enc->name, default_encoding);
>> + else
>> + error(msg, path, enc->name, default_encoding);
>
> The "error" side of this inner if() is dead code, I think.
Good catch. I think this code should go away if we keep the roundtrip
code and you agree with my statement above.
Thanks a lot for the review,
Lars
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-30 20:31 ` Lars Schneider
@ 2018-01-30 21:56 ` Junio C Hamano
2018-01-31 19:12 ` Lars Schneider
0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-01-30 21:56 UTC (permalink / raw)
To: Lars Schneider
Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
Jeff King, Ramsay Jones, Johannes.Schindelin
Lars Schneider <larsxschneider@gmail.com> writes:
>> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> tboegi@web.de writes:
>>
>>> + if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>>> + char *re_src;
>>> + int re_src_len;
>>
>> I think it is a bad idea to
>>
>> (1) not check without CONV_WRITE_OBJECT here.
>
> The idea is to perform the roundtrip check *only* if we
> actually write to Git. In all other cases we don't care
> if the encoding roundtrips.
>
> "git checkout" is such a case where we don't care as
> noted by Peff here:
> https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
>
> Do you agree?
I am not sure why this is special cased and other codepaths have "if
WRITE_OBJECT then die, otherwise error" checks, so no, I do not
agree with your reasoning, at least not yet.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-30 21:56 ` Junio C Hamano
@ 2018-01-31 19:12 ` Lars Schneider
2018-01-31 22:45 ` Junio C Hamano
0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-31 19:12 UTC (permalink / raw)
To: Junio C Hamano
Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
Jeff King, Ramsay Jones, Johannes.Schindelin
> On 30 Jan 2018, at 22:56, Junio C Hamano <gitster@pobox.com> wrote:
>
> Lars Schneider <larsxschneider@gmail.com> writes:
>
>>> On 30 Jan 2018, at 21:05, Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> tboegi@web.de writes:
>>>
>>>> + if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
>>>> + char *re_src;
>>>> + int re_src_len;
>>>
>>> I think it is a bad idea to
>>>
>>> (1) not check without CONV_WRITE_OBJECT here.
>>
>> The idea is to perform the roundtrip check *only* if we
>> actually write to Git. In all other cases we don't care
>> if the encoding roundtrips.
>>
>> "git checkout" is such a case where we don't care as
>> noted by Peff here:
>> https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
>>
>> Do you agree?
>
> I am not sure why this is special cased and other codepaths have "if
> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
> agree with your reasoning, at least not yet.
The convert_to_git()/encode_to_git() machinery is used in two different
kinds of code paths:
Some code paths actually write to the Git database (indicated by the
CONV_WRITE_OBJECT flag). I consider these the "critical/important" code
paths and I don't want to tolerate any encoding errors in these cases as
the errors would be "forever" in the Git database. That's why I call
die() on errors for these cases to abort whatever we are doing.
Other code paths do not write to the Git database (e.g. during "git
checkout" we use the code to ensure that we are moving away from the
exact state that we think we are moving away). In these code paths I am
less concerned about encoding errors. I also don't want to abort the
operation (e.g. "git checkout") in these cases. That's why I only inform
the user about the problem with an error message.
The encoding round-trip check can be expensive. That's why I decided
initially to only execute the check in the "critical/important"
write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also
decided to run it only if the "SHIFT-JIS" encoding is used as this was
the only encoding that I could find which reportedly does not round-trip
with UTF-8 (although I was not able to replicate the round-trip
problems).
I want to change the current implementation as follows:
I want to check the round-trip encoding only if the the environment
variable "GIT_WORKING_TREE_ENCODING_ROUNDTRIP_CHECK" is set. This way
a user can check the round-trip if necessary for *any* encoding. I
don't want to make it a git config because that setting should only
rarely be used for debugging purposes.
Performing the round-trip check every time is not necessary from my
point of view because it can be expensive and I was not able to generate
a test case which *does not* round-trip without triggering any other
iconv error.
- Lars
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute
2018-01-31 19:12 ` Lars Schneider
@ 2018-01-31 22:45 ` Junio C Hamano
0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-01-31 22:45 UTC (permalink / raw)
To: Lars Schneider
Cc: Torsten Bögershausen, Git List, Johannes Sixt, Eric Sunshine,
Jeff King, Ramsay Jones, Johannes.Schindelin
Lars Schneider <larsxschneider@gmail.com> writes:
>> I am not sure why this is special cased and other codepaths have "if
>> WRITE_OBJECT then die, otherwise error" checks, so no, I do not
>> agree with your reasoning, at least not yet.
>
> The convert_to_git()/encode_to_git() machinery is used in two different
> kinds of code paths:
>
> Some code paths actually write to the Git database (indicated by the
> CONV_WRITE_OBJECT flag). I consider these the "critical/important" code
> paths and I don't want to tolerate any encoding errors in these cases as
> the errors would be "forever" in the Git database. That's why I call
> die() on errors for these cases to abort whatever we are doing.
>
> Other code paths do not write to the Git database (e.g. during "git
> checkout" we use the code to ensure that we are moving away from the
> exact state that we think we are moving away). In these code paths I am
> less concerned about encoding errors. I also don't want to abort the
> operation (e.g. "git checkout") in these cases. That's why I only inform
> the user about the problem with an error message.
Warning the users early while they are doing non-writing
operation to give them chance to adjust the contents, before they
actually need to register the contents as objects by writing, at
which point we need to die. That's a reasonable distinction and all
of that I already agree with.
What was questionable and left unexplained was why this roundtrip
thing needs to be different.
> The encoding round-trip check can be expensive. That's why I decided
> initially to only execute the check in the "critical/important"
> write-to-Git-database situations (CONV_WRITE_OBJECT flag!). I also
> decided to run it only if the "SHIFT-JIS" encoding is used as this was
> the only encoding that I could find which reportedly does not round-trip
> with UTF-8 (although I was not able to replicate the round-trip
> problems).
I still do not see why you have problems with the approach of
maintaining a configurable set of "iffy" encodings (and throw SJIS
into the default list) to achieve all of the above and more. For
SJIS users, instead of having to set environment variables to obtain
safe behaviour, they automatically get safe behaviour. When using
encodings that are not problematic, they do not need to spend cycles
checking round-trip. And when SJIS users know they do not care about
roundtrip checks, they can just configure SJIS away from the list.
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v5 6/7] convert: add tracing for 'working-tree-encoding' attribute
2018-01-23 20:53 ` Junio C Hamano
` (5 preceding siblings ...)
2018-01-29 20:19 ` [PATCH v5 5/7] convert: add 'working-tree-encoding' attribute tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-29 20:19 ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
7 siblings, 0 replies; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Lars Schneider, Torsten Bögershausen
From: Lars Schneider <larsxschneider@gmail.com>
Add the GIT_TRACE_CHECKOUT_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
convert.c | 28 ++++++++++++++++++++++++++++
t/t0028-working-tree-encoding.sh | 2 ++
2 files changed, 30 insertions(+)
diff --git a/convert.c b/convert.c
index 0c372069b..13fad490c 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
}
+static void trace_encoding(const char *context, const char *path,
+ const char *encoding, const char *buf, size_t len)
+{
+ static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+ struct strbuf trace = STRBUF_INIT;
+ int i;
+
+ strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+ for (i = 0; i < len && buf; ++i) {
+ strbuf_addf(
+ &trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+ i,
+ (unsigned char) buf[i],
+ (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+ ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+ );
+ }
+ strbuf_addchars(&trace, '\n', 1);
+
+ trace_strbuf(&coe, &trace);
+ strbuf_release(&trace);
+}
+
static struct encoding {
const char *name;
struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
error(error_msg, path, enc->name);
}
+ trace_encoding("source", path, enc->name, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc->name,
&dst_len);
if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
else
error(msg, path, enc->name, default_encoding);
}
+ trace_encoding("destination", path, default_encoding, dst, dst_len);
/*
* UTF supports lossless round tripping [1]. UTF to other encoding are
@@ -365,6 +390,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
enc->name, default_encoding,
&re_src_len);
+ trace_encoding("reencoded source", path, enc->name,
+ re_src, re_src_len);
+
if (!re_src || src_len != re_src_len ||
memcmp(src, re_src, src_len)) {
const char* msg = _("encoding '%s' from %s to %s and "
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 4d85b4277..0f36d4990 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
. ./test-lib.sh
+GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+
test_expect_success 'setup test repo' '
git config core.eol lf &&
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-23 20:53 ` Junio C Hamano
` (6 preceding siblings ...)
2018-01-29 20:19 ` [PATCH v5 6/7] convert: add tracing for " tboegi
@ 2018-01-29 20:19 ` tboegi
2018-01-30 11:23 ` Lars Schneider
7 siblings, 1 reply; 43+ messages in thread
From: tboegi @ 2018-01-29 20:19 UTC (permalink / raw)
To: lars.schneider, git, j6t, sunshine, peff, ramsay,
Johannes.Schindelin, --to=larsxschneider
Cc: Torsten Bögershausen
From: Torsten Bögershausen <tboegi@web.de>
UTF-16 encoded files are treated as "binary" by Git, and no CRLF
conversion is done.
When the UTF-16 encoded files are converted into UF-8 using the new
"working-tree-encoding", the CRLF are converted if core.autocrlf is true.
This may lead to confusion:
A tool writes an UTF-16 encoded file with CRLF.
The file is commited with core.autocrlf=true, the CLRF are converted into LF.
The repo is pushed somewhere and cloned by a different user, who has
decided to use core.autocrlf=false.
He uses the same tool, and now the CRLF are not there as expected, but LF,
make the file useless for the tool.
Avoid this (possible) confusion by ignoring core.autocrlf for all files
which have "working-tree-encoding" defined.
The user can still use a .gitattributes file and specify the line endings
like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
file travel together with push and clone.
Change convert.c to e more careful, simplify the initialization when
attributes are retrived (and none are specified) and update the documentation.
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
Documentation/gitattributes.txt | 9 ++++++---
convert.c | 15 ++++++++++++---
2 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be3..3665c4677 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
UTF-8 encoding and if you want Git to be able to process the content as
text.
+Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
+Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
+
Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+with byte order mark (BOM) and you want Git to perform line
+ending conversion based on core.eol.
------------------------
-*.txt text working-tree-encoding=UTF-16
+*.txt working-tree-encoding=UTF-16 text
------------------------
Use the following attributes if your '*.txt' files are UTF-16 little
diff --git a/convert.c b/convert.c
index 13fad490c..e7f11d1db 100644
--- a/convert.c
+++ b/convert.c
@@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
}
ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
} else {
- ca->drv = NULL;
- ca->crlf_action = CRLF_UNDEFINED;
- ca->ident = 0;
+ memset(ca, 0, sizeof(*ca));
}
/* Save attr and make a decision for action */
ca->attr_action = ca->crlf_action;
if (ca->crlf_action == CRLF_TEXT)
ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
+ /*
+ * Often UTF-16 encoded files are read and written by programs which
+ * really need CRLF, and it is important to keep the CRLF "as is" when
+ * files are committed with core.autocrlf=true and the repo is pushed.
+ * The CRLF would be converted into LF when the repo is cloned to
+ * a machine with core.autocrlf=false.
+ * Obey the "text" and "eol" attributes and be independent on the
+ * local core.autocrlf for all "encoded" files.
+ */
+ if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
+ ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
--
2.16.0.rc0.2.g64d3e4d0cc.dirty
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-29 20:19 ` [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding tboegi
@ 2018-01-30 11:23 ` Lars Schneider
2018-01-30 14:40 ` Torsten Bögershausen
0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 11:23 UTC (permalink / raw)
To: tboegi
Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
> On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
>
> From: Torsten Bögershausen <tboegi@web.de>
>
> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> conversion is done.
> When the UTF-16 encoded files are converted into UF-8 using the new
s/UF-8/UTF-8/
> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
>
> This may lead to confusion:
> A tool writes an UTF-16 encoded file with CRLF.
> The file is commited with core.autocrlf=true, the CLRF are converted into LF.
> The repo is pushed somewhere and cloned by a different user, who has
> decided to use core.autocrlf=false.
> He uses the same tool, and now the CRLF are not there as expected, but LF,
> make the file useless for the tool.
>
> Avoid this (possible) confusion by ignoring core.autocrlf for all files
> which have "working-tree-encoding" defined.
Maybe I don't understand your use case but I think this will generate even
more confusion because that's not what I would expect as a user. I think Git
should behave consistently independent of the used encoding. Here are my arguments:
(1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
attribute then nothing changes for you.
(2) If you use the "working-tree-encoding" attribute *and* you want to ensure
your file keeps CRLF then you can define that in the attributes too. E.g.:
*.proj textworking-tree-encoding=UTF-16 eol=crlf
- Lars
> The user can still use a .gitattributes file and specify the line endings
> like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> file travel together with push and clone.
>
> Change convert.c to e more careful, simplify the initialization when
> attributes are retrived (and none are specified) and update the documentation.
>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
> Documentation/gitattributes.txt | 9 ++++++---
> convert.c | 15 ++++++++++++---
> 2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index a8dbf4be3..3665c4677 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.
>
> +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
> +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> +
> Use the following attributes if your '*.txt' files are UTF-16 encoded
> -with byte order mark (BOM) and you want Git to perform automatic line
> -ending conversion based on your platform.
> +with byte order mark (BOM) and you want Git to perform line
> +ending conversion based on core.eol.
>
> ------------------------
> -*.txt text working-tree-encoding=UTF-16
> +*.txt working-tree-encoding=UTF-16 text
> ------------------------
>
> Use the following attributes if your '*.txt' files are UTF-16 little
> diff --git a/convert.c b/convert.c
> index 13fad490c..e7f11d1db 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> }
> ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
> } else {
> - ca->drv = NULL;
> - ca->crlf_action = CRLF_UNDEFINED;
> - ca->ident = 0;
> + memset(ca, 0, sizeof(*ca));
> }
>
> /* Save attr and make a decision for action */
> ca->attr_action = ca->crlf_action;
> if (ca->crlf_action == CRLF_TEXT)
> ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
> + /*
> + * Often UTF-16 encoded files are read and written by programs which
> + * really need CRLF, and it is important to keep the CRLF "as is" when
> + * files are committed with core.autocrlf=true and the repo is pushed.
> + * The CRLF would be converted into LF when the repo is cloned to
> + * a machine with core.autocrlf=false.
> + * Obey the "text" and "eol" attributes and be independent on the
> + * local core.autocrlf for all "encoded" files.
> + */
> + if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
> + ca->crlf_action = CRLF_BINARY;
> if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
> ca->crlf_action = CRLF_BINARY;
> if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
> --
> 2.16.0.rc0.2.g64d3e4d0cc.dirty
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-30 11:23 ` Lars Schneider
@ 2018-01-30 14:40 ` Torsten Bögershausen
2018-01-30 15:14 ` Lars Schneider
0 siblings, 1 reply; 43+ messages in thread
From: Torsten Bögershausen @ 2018-01-30 14:40 UTC (permalink / raw)
To: Lars Schneider
Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
>
> > On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
> >
> > From: Torsten Bögershausen <tboegi@web.de>
> >
> > UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> > conversion is done.
> > When the UTF-16 encoded files are converted into UF-8 using the new
> s/UF-8/UTF-8/
>
>
> > "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
> >
> > This may lead to confusion:
> > A tool writes an UTF-16 encoded file with CRLF.
> > The file is commited with core.autocrlf=true, the CLRF are converted into LF.
> > The repo is pushed somewhere and cloned by a different user, who has
> > decided to use core.autocrlf=false.
> > He uses the same tool, and now the CRLF are not there as expected, but LF,
> > make the file useless for the tool.
> >
> > Avoid this (possible) confusion by ignoring core.autocrlf for all files
> > which have "working-tree-encoding" defined.
>
> Maybe I don't understand your use case but I think this will generate even
> more confusion because that's not what I would expect as a user. I think Git
> should behave consistently independent of the used encoding. Here are my arguments:
To start with: I have probably seen too many repos with CRLF messed up.
>
> (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
> attribute then nothing changes for you.
People who don't use "working-tree-encoding" are not affected,
I never ment to state that.
I am thinking about people who use "working-tree-encoding" without thinking
about line endings.
Or the ones that have in mind that core.autocrlf=true will leave the
line endings for UTF-16 encoded files as is, but that changes as soon as they
are converted into UTF-8 and the "auto" check is now done
-after- the conversion. I would find that confusing.
>
> (2) If you use the "working-tree-encoding" attribute *and* you want to ensure
> your file keeps CRLF then you can define that in the attributes too. E.g.:
>
> *.proj textworking-tree-encoding=UTF-16 eol=crlf
That is a good one.
If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
make much more sense in Documentation/gitattributes that *.tx
There no text files encoded in UTF-16 wich are called xxx.txt, but those
are non-ideal examples. *.proj makes good sense as an example.
>
> - Lars
>
>
>
> > The user can still use a .gitattributes file and specify the line endings
> > like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> > file travel together with push and clone.
> >
> > Change convert.c to e more careful, simplify the initialization when
> > attributes are retrived (and none are specified) and update the documentation.
> >
> > Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> > ---
> > Documentation/gitattributes.txt | 9 ++++++---
> > convert.c | 15 ++++++++++++---
> > 2 files changed, 18 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> > index a8dbf4be3..3665c4677 100644
> > --- a/Documentation/gitattributes.txt
> > +++ b/Documentation/gitattributes.txt
> > @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in
> > UTF-8 encoding and if you want Git to be able to process the content as
> > text.
> >
> > +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
> > +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> > +
> > Use the following attributes if your '*.txt' files are UTF-16 encoded
> > -with byte order mark (BOM) and you want Git to perform automatic line
> > -ending conversion based on your platform.
> > +with byte order mark (BOM) and you want Git to perform line
> > +ending conversion based on core.eol.
> >
> > ------------------------
> > -*.txt text working-tree-encoding=UTF-16
> > +*.txt working-tree-encoding=UTF-16 text
> > ------------------------
> >
> > Use the following attributes if your '*.txt' files are UTF-16 little
> > diff --git a/convert.c b/convert.c
> > index 13fad490c..e7f11d1db 100644
> > --- a/convert.c
> > +++ b/convert.c
> > @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
> > }
> > ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
> > } else {
> > - ca->drv = NULL;
> > - ca->crlf_action = CRLF_UNDEFINED;
> > - ca->ident = 0;
> > + memset(ca, 0, sizeof(*ca));
> > }
> >
> > /* Save attr and make a decision for action */
> > ca->attr_action = ca->crlf_action;
> > if (ca->crlf_action == CRLF_TEXT)
> > ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT;
> > + /*
> > + * Often UTF-16 encoded files are read and written by programs which
> > + * really need CRLF, and it is important to keep the CRLF "as is" when
> > + * files are committed with core.autocrlf=true and the repo is pushed.
> > + * The CRLF would be converted into LF when the repo is cloned to
> > + * a machine with core.autocrlf=false.
> > + * Obey the "text" and "eol" attributes and be independent on the
> > + * local core.autocrlf for all "encoded" files.
> > + */
> > + if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
> > + ca->crlf_action = CRLF_BINARY;
> > if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
> > ca->crlf_action = CRLF_BINARY;
> > if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
> > --
> > 2.16.0.rc0.2.g64d3e4d0cc.dirty
> >
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-30 14:40 ` Torsten Bögershausen
@ 2018-01-30 15:14 ` Lars Schneider
2018-01-31 17:28 ` Torsten Bögershausen
0 siblings, 1 reply; 43+ messages in thread
From: Lars Schneider @ 2018-01-30 15:14 UTC (permalink / raw)
To: Torsten Bögershausen
Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
> On 30 Jan 2018, at 15:40, Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
>>
>>> On 29 Jan 2018, at 21:19, tboegi@web.de wrote:
>>>
>>> From: Torsten Bögershausen <tboegi@web.de>
>>>
>>> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
>>> conversion is done.
>>> When the UTF-16 encoded files are converted into UF-8 using the new
>> s/UF-8/UTF-8/
>>
>>
>>> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
>>>
>>> This may lead to confusion:
>>> A tool writes an UTF-16 encoded file with CRLF.
>>> The file is commited with core.autocrlf=true, the CLRF are converted into LF.
>>> The repo is pushed somewhere and cloned by a different user, who has
>>> decided to use core.autocrlf=false.
>>> He uses the same tool, and now the CRLF are not there as expected, but LF,
>>> make the file useless for the tool.
>>>
>>> Avoid this (possible) confusion by ignoring core.autocrlf for all files
>>> which have "working-tree-encoding" defined.
>>
>> Maybe I don't understand your use case but I think this will generate even
>> more confusion because that's not what I would expect as a user. I think Git
>> should behave consistently independent of the used encoding. Here are my arguments:
>
> To start with: I have probably seen too many repos with CRLF messed up.
>
>>
>> (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding"
>> attribute then nothing changes for you.
>
> People who don't use "working-tree-encoding" are not affected,
> I never ment to state that.
>
> I am thinking about people who use "working-tree-encoding" without thinking
> about line endings.
> Or the ones that have in mind that core.autocrlf=true will leave the
> line endings for UTF-16 encoded files as is, but that changes as soon as they
> are converted into UTF-8 and the "auto" check is now done
> -after- the conversion. I would find that confusing.
>
>>
>> (2) If you use the "working-tree-encoding" attribute *and* you want to ensure
>> your file keeps CRLF then you can define that in the attributes too. E.g.:
>>
>> *.proj textworking-tree-encoding=UTF-16 eol=crlf
>
> That is a good one.
> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> make much more sense in Documentation/gitattributes that *.tx
> There no text files encoded in UTF-16 wich are called xxx.txt, but those
> are non-ideal examples. *.proj makes good sense as an example.
OK, I'll do that. Would that fix the problem which this patch tries to address for you?
(I would also explicitly add a paragraph to discuss line endings)
- Lars
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-30 15:14 ` Lars Schneider
@ 2018-01-31 17:28 ` Torsten Bögershausen
2018-01-31 19:37 ` Lars Schneider
2018-02-02 19:17 ` Junio C Hamano
0 siblings, 2 replies; 43+ messages in thread
From: Torsten Bögershausen @ 2018-01-31 17:28 UTC (permalink / raw)
To: Lars Schneider
Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
[]
> > That is a good one.
> > If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> > make much more sense in Documentation/gitattributes that *.tx
> > There no text files encoded in UTF-16 wich are called xxx.txt, but those
> > are non-ideal examples. *.proj makes good sense as an example.
>
> OK, I'll do that. Would that fix the problem which this patch tries to address for you?
> (I would also explicitly add a paragraph to discuss line endings)
Please let me see the patch first, before I can have a comment.
But back to the more general question:
How should Git handle the line endings of UTF-16 files in the woring-tree,
which are UTF-8 in the index?
There are 2 opposite opionions/user expectations here:
a) They are binary in the working tree, so git should leave the line endings
as is. (Unless specified otherwise in the .attributes file)
b) They are text files in the index. Git will convert line endings
if core.autocrlf is true (or the .gitattributes file specifies "-text")
My feeling is that both arguments are valid, so let's ask for opinions
and thoughts of others.
Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
What do yo think ?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-31 17:28 ` Torsten Bögershausen
@ 2018-01-31 19:37 ` Lars Schneider
2018-02-02 19:17 ` Junio C Hamano
1 sibling, 0 replies; 43+ messages in thread
From: Lars Schneider @ 2018-01-31 19:37 UTC (permalink / raw)
To: Torsten Bögershausen
Cc: Git List, Johannes Sixt, Eric Sunshine, Jeff King, Ramsay Jones,
Johannes.Schindelin
> On 31 Jan 2018, at 18:28, Torsten Bögershausen <tboegi@web.de> wrote:
>
> []
>>> That is a good one.
>>> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
>>> make much more sense in Documentation/gitattributes that *.tx
>>> There no text files encoded in UTF-16 wich are called xxx.txt, but those
>>> are non-ideal examples. *.proj makes good sense as an example.
>>
>> OK, I'll do that. Would that fix the problem which this patch tries to address for you?
>> (I would also explicitly add a paragraph to discuss line endings)
>
> Please let me see the patch first, before I can have a comment.
Sure! I'll have it ready tomorrow.
> But back to the more general question:
>
> How should Git handle the line endings of UTF-16 files in the woring-tree,
> which are UTF-8 in the index?
>
>
> There are 2 opposite opionions/user expectations here:
>
> a) They are binary in the working tree, so git should leave the line endings
> as is. (Unless specified otherwise in the .attributes file)
Well, if you consider your UTF-16 files binary then you would not change
*anything*. You would not enable the new "working-tree-encoding" attribute.
As a consequence, Git's behavior would not change. Git would leave all line
endings as they are for the UTF-16 files.
> b) They are text files in the index. Git will convert line endings
> if core.autocrlf is true (or the .gitattributes file specifies "-text")
This would *only* happen if you enable the new "working-tree-encoding"
attribute. In this case a user has already made the conscious decision to
treat these files as text files. Therefore, the user expects Git to handle
them in the same way other text files are handled.
> My feeling is that both arguments are valid, so let's ask for opinions
> and thoughts of others.
> Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
> What do yo think ?
- Lars
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-01-31 17:28 ` Torsten Bögershausen
2018-01-31 19:37 ` Lars Schneider
@ 2018-02-02 19:17 ` Junio C Hamano
2018-02-07 6:31 ` Torsten Bögershausen
1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2018-02-02 19:17 UTC (permalink / raw)
To: Torsten Bögershausen
Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
Ramsay Jones, Johannes.Schindelin
Torsten Bögershausen <tboegi@web.de> writes:
> There are 2 opposite opionions/user expectations here:
>
> a) They are binary in the working tree, so git should leave the line endings
> as is. (Unless specified otherwise in the .attributes file)
> ...
> b) They are text files in the index. Git will convert line endings
> if core.autocrlf is true (or the .gitattributes file specifies "-text")
I sense that you seem to be focusing on the distinction between "in
the working tree" vs "in the index" while contrasting. The "binary
vs text" in your "binary in wt, text in index" is based on the
default heuristics without any input from end-users or the project
that uses Git that happens to contain such files. If the users and
the project that uses Git want to treat contents in a path as text,
it is text even when it is (re-)encoded to UTF-16, no?
Such files may be (mis)classified as binary with the default
heuristics when there is no help from what is written in the
.gitattributes file, but here we are talking about the case where
the user explicitly tells us it is in UTF-16, right? Is there such a
thing as UTF-16 binary?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-02-02 19:17 ` Junio C Hamano
@ 2018-02-07 6:31 ` Torsten Bögershausen
2018-02-07 18:12 ` Junio C Hamano
0 siblings, 1 reply; 43+ messages in thread
From: Torsten Bögershausen @ 2018-02-07 6:31 UTC (permalink / raw)
To: Junio C Hamano
Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
Ramsay Jones, Johannes.Schindelin
On Fri, Feb 02, 2018 at 11:17:04AM -0800, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > There are 2 opposite opionions/user expectations here:
> >
> > a) They are binary in the working tree, so git should leave the line endings
> > as is. (Unless specified otherwise in the .attributes file)
> > ...
> > b) They are text files in the index. Git will convert line endings
> > if core.autocrlf is true (or the .gitattributes file specifies "-text")
>
> I sense that you seem to be focusing on the distinction between "in
> the working tree" vs "in the index" while contrasting. The "binary
> vs text" in your "binary in wt, text in index" is based on the
> default heuristics without any input from end-users or the project
> that uses Git that happens to contain such files. If the users and
> the project that uses Git want to treat contents in a path as text,
> it is text even when it is (re-)encoded to UTF-16, no?
>
> Such files may be (mis)classified as binary with the default
> heuristics when there is no help from what is written in the
> .gitattributes file, but here we are talking about the case where
> the user explicitly tells us it is in UTF-16, right? Is there such a
> thing as UTF-16 binary?
I don't think so, by definiton UTF-16 is ment to be text.
(this means that git ls-files --eol needs some update, I can have a look)
Do we agree that UTF-16 is text ?
If yes, could Git assume that the "text" attribute is set when
working-tree-encoding is set ?
I would even go a step further and demand that the user -must- make a decision
about the line endings for working-tree-encoded files:
working-tree-encoding=UTF-16 # illegal, die()
working-tree-encoding=UTF-16 text=auto # illegal, die()
working-tree-encoding=UTF-16 -text # no eol conversion
working-tree-encoding=UTF-16 text # eol according to core.eol
working-tree-encoding=UTF-16 text eol=lf # LF
working-tree-encoding=UTF-16 text eol=crlf # CRLF
What do you think ?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
2018-02-07 6:31 ` Torsten Bögershausen
@ 2018-02-07 18:12 ` Junio C Hamano
0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2018-02-07 18:12 UTC (permalink / raw)
To: Torsten Bögershausen
Cc: Lars Schneider, Git List, Johannes Sixt, Eric Sunshine, Jeff King,
Ramsay Jones, Johannes.Schindelin
Torsten Bögershausen <tboegi@web.de> writes:
>> the user explicitly tells us it is in UTF-16, right? Is there such a
>> thing as UTF-16 binary?
>
> I don't think so, by definiton UTF-16 is ment to be text.
> (this means that git ls-files --eol needs some update, I can have a look)
>
> Do we agree that UTF-16 is text ?
> If yes, could Git assume that the "text" attribute is set when
> working-tree-encoding is set ?
These are two different questions. It seems that between the two of
us, we established that "UTF-16 binary" is a nonsense, and a path
that is given working-tree-encoding=UTF-16 must be treated as
holding a text file. But that does not have direct relevance to the
other question you are asking: "is a question 'does this path have
text attribute set?' be answered with 'yes' if the path has wte
attribute set to UTF-16?" I think the answer to that latter
question ought to be "no".
By the way, a related tangent is if it makes sense to give
working-tree-encoding to anything that is binary, regardless of the
value---I am inclined to say it is not, so the issue here is not "by
definition UTF-16 is text", but "any path that has wte set to some
enconding should be treated the same way as if the path also has
text attribute set as far as convert machinery is concerned.".
^ permalink raw reply [flat|nested] 43+ messages in thread