* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
2019-02-09 20:08 ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
@ 2019-02-10 1:45 ` Eric Sunshine
2019-02-10 18:14 ` brian m. carlson
2019-02-10 8:04 ` Torsten Bögershausen
` (3 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: Eric Sunshine @ 2019-02-10 1:45 UTC (permalink / raw)
To: brian m. carlson
Cc: Git List, Lars Schneider, Rich Felker, Junio C Hamano,
Kevin Daudt
On Sat, Feb 9, 2019 at 3:08 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> [...]
> Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
The name ICONV_NEEDS_BOM makes it sound as if we must feed a BOM
_into_ 'iconv', which is quite confusing since the actual intention is
that 'iconv' doesn't emit a BOM and we need to make up for the
deficiency. Using a name such as ICONV_OMITS_BOM or ICONV_NEGLECTS_BOM
makes it somewhat clearer that there is some deficiency with which we
need to deal.
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/Makefile b/Makefile
> @@ -259,6 +259,9 @@ all::
> +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32.
Not a big deal, but I wonder if it would be helpful to tack on "...,
in which case it outputs big-endian unconditionally." or something.
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
> +test_lazy_prereq NO_UTF16_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> + test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
> + iconv -f UTF-8 -t UTF-16
> +
> +}
Stray blank line before the closing brace.
> +
> +write_utf32 () {
> + test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
> + iconv -f UTF-8 -t UTF-32
> +}
It's probably doesn't matter much with these two tiny functions, but I
was wondering if it would make sense to maintain the &&-chain, perhaps
like this:
if test test_have_prereq NO_UTF32_BOM
then
printf '\x00\x00\xfe\xff'
fi &&
iconv -f UTF-8 -t UTF-32
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
2019-02-10 1:45 ` Eric Sunshine
@ 2019-02-10 18:14 ` brian m. carlson
0 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-10 18:14 UTC (permalink / raw)
To: Eric Sunshine
Cc: Git List, Lars Schneider, Rich Felker, Junio C Hamano,
Kevin Daudt
[-- Attachment #1: Type: text/plain, Size: 1957 bytes --]
On Sat, Feb 09, 2019 at 08:45:16PM -0500, Eric Sunshine wrote:
> On Sat, Feb 9, 2019 at 3:08 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > [...]
> > Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> > iconv implementation has this behavior. When set, Git will write a BOM
> > manually for UTF-16 and UTF-32 and then force the data to be written in
> > UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> > tests use the raw "UTF-16" encoding, which will be big-endian when the
> > implementation requires this knob to be set.
>
> The name ICONV_NEEDS_BOM makes it sound as if we must feed a BOM
> _into_ 'iconv', which is quite confusing since the actual intention is
> that 'iconv' doesn't emit a BOM and we need to make up for the
> deficiency. Using a name such as ICONV_OMITS_BOM or ICONV_NEGLECTS_BOM
> makes it somewhat clearer that there is some deficiency with which we
> need to deal.
That does sound like a better name.
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> > diff --git a/Makefile b/Makefile
> > @@ -259,6 +259,9 @@ all::
> > +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> > +# byte-order mark (BOM) when writing UTF-16 or UTF-32.
>
> Not a big deal, but I wonder if it would be helpful to tack on "...,
> in which case it outputs big-endian unconditionally." or something.
Sure, I can add that.
> Stray blank line before the closing brace.
Will fix.
> It's probably doesn't matter much with these two tiny functions, but I
> was wondering if it would make sense to maintain the &&-chain, perhaps
> like this:
>
> if test test_have_prereq NO_UTF32_BOM
> then
> printf '\x00\x00\xfe\xff'
> fi &&
> iconv -f UTF-8 -t UTF-32
Yeah, that sounds like a good idea.
--
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
2019-02-09 20:08 ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
2019-02-10 1:45 ` Eric Sunshine
@ 2019-02-10 8:04 ` Torsten Bögershausen
2019-02-10 18:55 ` brian m. carlson
2019-02-11 0:23 ` [PATCH v2] " brian m. carlson
` (2 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: Torsten Bögershausen @ 2019-02-10 8:04 UTC (permalink / raw)
To: brian m. carlson
Cc: git, larsxschneider, Rich Felker, Junio C Hamano, Kevin Daudt
On Sat, Feb 09, 2019 at 08:08:01PM +0000, brian m. carlson wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
>
> Most systems' iconv implementations choose to write it with a BOM in
> some endianness, since this is the most foolproof, and it is resistant
> to misinterpretation on Windows, where UTF-16 and the little-endian
> serialization are very common. For compatibility with Windows and to
> avoid accidental misuse there, Git always wants to write UTF-16 with a
> BOM, and will refuse to read UTF-16 without it.
>
> However, musl's iconv implementation writes UTF-16 without a BOM,
> relying on the user to interpret it as big-endian. This causes t0028 and
> the related functionality to fail, since Git won't read the file without
> a BOM.
>
> Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
>
> Update the tests to detect this case and write test data with an added
> BOM if necessary. Always write the BOM in the tests in big-endian
> format, since all iconv implementations that omit a BOM must use
> big-endian serialization according to the Unicode standard.
>
> Preserve the existing behavior for systems which do not have this knob
> enabled, since they may use optimized implementations, including
> defaulting to the native endianness, to gain improved performance, which
> can be significant with large checkouts.
Is the based on measurements on a real system ?
I think we agree that Git will write UTF-16 always as big endian with BOM,
following the tradition of iconv/libiconv.
If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
and reduce the maintenance burden a little bit, giving a simpler patch.
What do you think ?
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..ef19c98e67 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -13,7 +13,8 @@ test_expect_success 'setup test files' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
printf "$text" >test.utf8.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+ printf "\376\377" >test.utf16.raw &&
+ printf "$text" | iconv -f UTF-8 -t UTF-16BE >>test.utf16.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
printf "\377\376" >test.utf16lebom.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..d3731273be 100644
--- a/utf8.c
+++ b/utf8.c
@@ -564,7 +564,8 @@ char *reencode_string_len(const char *in, size_t insz,
bom_str = utf16_le_bom;
bom_len = sizeof(utf16_le_bom);
out_encoding = "UTF-16LE";
- } else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+ } else if (same_utf_encoding("UTF-16BE-BOM", out_encoding) ||
+ same_utf_encoding("UTF-16", out_encoding)) {
bom_str = utf16_be_bom;
bom_len = sizeof(utf16_be_bom);
out_encoding = "UTF-16BE";
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> Makefile | 6 ++++++
> t/t0028-working-tree-encoding.sh | 25 ++++++++++++++++++++++---
> utf8.c | 10 ++++++++++
> 3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 571160a2c4..b2a4765e5f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -259,6 +259,9 @@ all::
> # Define OLD_ICONV if your library has an old iconv(), where the second
> # (input buffer pointer) parameter is declared with type (const char **).
> #
> +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32.
> +#
> # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
> #
> # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
> @@ -1415,6 +1418,9 @@ ifndef NO_ICONV
> EXTLIBS += $(ICONV_LINK) -liconv
> endif
> endif
> +ifdef ICONV_NEEDS_BOM
> + BASIC_CFLAGS += -DICONV_NEEDS_BOM
> +endif
> ifdef NEEDS_LIBGEN
> EXTLIBS += -lgen
> endif
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index e58ecbfc44..bfc4a9d4dd 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
>
> GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
>
> +test_lazy_prereq NO_UTF16_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> + test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
> + iconv -f UTF-8 -t UTF-16
> +
> +}
> +
> +write_utf32 () {
> + test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
> + iconv -f UTF-8 -t UTF-32
> +}
> +
> test_expect_success 'setup test files' '
> git config core.eol lf &&
>
> @@ -13,8 +32,8 @@ test_expect_success 'setup test files' '
> echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
> echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
> printf "$text" >test.utf8.raw &&
> - printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> - printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> + printf "$text" | write_utf16 >test.utf16.raw &&
> + printf "$text" | write_utf32 >test.utf32.raw &&
> printf "\377\376" >test.utf16lebom.raw &&
> printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
>
> @@ -223,7 +242,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>
> text="hallo there!\nroundtrip test here!" &&
> printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> - printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> + printf "$text" | write_utf16 >roundtrip.utf16 &&
> echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>
> # SHIFT-JIS encoded files are round-trip checked by default...
> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..133199de0e 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
> bom_str = utf16_be_bom;
> bom_len = sizeof(utf16_be_bom);
> out_encoding = "UTF-16BE";
> +#ifdef ICONV_NEEDS_BOM
> + } else if (same_utf_encoding("UTF-16", out_encoding)) {
> + bom_str = utf16_be_bom;
> + bom_len = sizeof(utf16_be_bom);
> + out_encoding = "UTF-16BE";
> + } else if (same_utf_encoding("UTF-32", out_encoding)) {
> + bom_str = utf32_be_bom;
> + bom_len = sizeof(utf32_be_bom);
> + out_encoding = "UTF-32BE";
> +#endif
> }
>
> conv = iconv_open(out_encoding, in_encoding);
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
2019-02-10 8:04 ` Torsten Bögershausen
@ 2019-02-10 18:55 ` brian m. carlson
2019-02-11 17:14 ` Junio C Hamano
0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-10 18:55 UTC (permalink / raw)
To: Torsten Bögershausen
Cc: git, larsxschneider, Rich Felker, Junio C Hamano, Kevin Daudt
[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]
On Sun, Feb 10, 2019 at 08:04:13AM +0000, Torsten Bögershausen wrote:
> On Sat, Feb 09, 2019 at 08:08:01PM +0000, brian m. carlson wrote:
> > Preserve the existing behavior for systems which do not have this knob
> > enabled, since they may use optimized implementations, including
> > defaulting to the native endianness, to gain improved performance, which
> > can be significant with large checkouts.
>
> Is the based on measurements on a real system ?
No, I haven't done any performance measurements. However, swapping bytes
is a (IIRC 1-cycle) instruction on x86, which would be executed for each
iteration of the loop. My intuition tells me that will be a significant
expense when there are a lot of files, but I can omit that phrase since
I haven't measured.
> I think we agree that Git will write UTF-16 always as big endian with BOM,
> following the tradition of iconv/libiconv.
> If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
> and reduce the maintenance burden a little bit, giving a simpler patch.
No, I don't think it will. libiconv will always write big-endian, but
glibc has a separate iconv implementation which writes the native
endianness. (I believe FreeBSD's does the same thing as glibc's.) I
think it's useful for us to know that we can handle UTF-16 using the
system behavior where possible, since that's what the system is going to
produce.
> What do you think ?
While I like the simplicity of the approach, as I mentioned above, and I
did consider this originally, I'd rather test the behavior of the system
we're operating on, provided it's suitable for our needs.
--
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
2019-02-10 18:55 ` brian m. carlson
@ 2019-02-11 17:14 ` Junio C Hamano
0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2019-02-11 17:14 UTC (permalink / raw)
To: brian m. carlson
Cc: Torsten Bögershausen, git, larsxschneider, Rich Felker,
Kevin Daudt
"brian m. carlson" <sandals@crustytoothpaste.net> writes:
> On Sun, Feb 10, 2019 at 08:04:13AM +0000, Torsten Bögershausen wrote:
>
>> I think we agree that Git will write UTF-16 always as big endian with BOM,
>> following the tradition of iconv/libiconv.
>> If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
>> and reduce the maintenance burden a little bit, giving a simpler patch.
>
> No, I don't think it will. libiconv will always write big-endian, but
> glibc has a separate iconv implementation which writes the native
> endianness. (I believe FreeBSD's does the same thing as glibc's.) I
> think it's useful for us to know that we can handle UTF-16 using the
> system behavior where possible, since that's what the system is going to
> produce.
>
>> What do you think ?
>
> While I like the simplicity of the approach, as I mentioned above, and I
> did consider this originally, I'd rather test the behavior of the system
> we're operating on, provided it's suitable for our needs.
I see both sides of the argument, and each has its merit.
Let's go with the "follow the platform" and make sure the decision
is documented somewhere in the resulting code.
Thanks, all.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
2019-02-09 20:08 ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
2019-02-10 1:45 ` Eric Sunshine
2019-02-10 8:04 ` Torsten Bögershausen
@ 2019-02-11 0:23 ` brian m. carlson
2019-02-11 1:16 ` Eric Sunshine
2019-02-11 1:26 ` [PATCH v3] " brian m. carlson
2019-02-12 0:52 ` [PATCH v4] " brian m. carlson
4 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11 0:23 UTC (permalink / raw)
To: git
Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
Torsten Bögershausen, Rich Felker, Kevin Daudt
When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.
Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.
However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.
Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.
Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.
Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
Makefile | 6 ++++++
t/t0028-working-tree-encoding.sh | 25 ++++++++++++++++++++++---
utf8.c | 10 ++++++++++
3 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/Makefile b/Makefile
index 571160a2c4..b2a4765e5f 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,9 @@ all::
# Define OLD_ICONV if your library has an old iconv(), where the second
# (input buffer pointer) parameter is declared with type (const char **).
#
+# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32.
+#
# Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
#
# Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1418,9 @@ ifndef NO_ICONV
EXTLIBS += $(ICONV_LINK) -liconv
endif
endif
+ifdef ICONV_NEEDS_BOM
+ BASIC_CFLAGS += -DICONV_NEEDS_BOM
+endif
ifdef NEEDS_LIBGEN
EXTLIBS += -lgen
endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..bfc4a9d4dd 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+test_lazy_prereq NO_UTF16_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+ test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
+ iconv -f UTF-8 -t UTF-16
+
+}
+
+write_utf32 () {
+ test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
+ iconv -f UTF-8 -t UTF-32
+}
+
test_expect_success 'setup test files' '
git config core.eol lf &&
@@ -13,8 +32,8 @@ test_expect_success 'setup test files' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
printf "$text" >test.utf8.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+ printf "$text" | write_utf16 >test.utf16.raw &&
+ printf "$text" | write_utf32 >test.utf32.raw &&
printf "\377\376" >test.utf16lebom.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
@@ -223,7 +242,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
text="hallo there!\nroundtrip test here!" &&
printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+ printf "$text" | write_utf16 >roundtrip.utf16 &&
echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..133199de0e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
bom_str = utf16_be_bom;
bom_len = sizeof(utf16_be_bom);
out_encoding = "UTF-16BE";
+#ifdef ICONV_NEEDS_BOM
+ } else if (same_utf_encoding("UTF-16", out_encoding)) {
+ bom_str = utf16_be_bom;
+ bom_len = sizeof(utf16_be_bom);
+ out_encoding = "UTF-16BE";
+ } else if (same_utf_encoding("UTF-32", out_encoding)) {
+ bom_str = utf32_be_bom;
+ bom_len = sizeof(utf32_be_bom);
+ out_encoding = "UTF-32BE";
+#endif
}
conv = iconv_open(out_encoding, in_encoding);
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
2019-02-11 0:23 ` [PATCH v2] " brian m. carlson
@ 2019-02-11 1:16 ` Eric Sunshine
2019-02-11 1:20 ` brian m. carlson
0 siblings, 1 reply; 30+ messages in thread
From: Eric Sunshine @ 2019-02-11 1:16 UTC (permalink / raw)
To: brian m. carlson
Cc: Git List, Lars Schneider, Junio C Hamano,
Torsten Bögershausen, Rich Felker, Kevin Daudt
On Sun, Feb 10, 2019 at 7:23 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Premature git-send-email invocation? The commit message of v2 seems to
be a bit different from v1, but the patch itself is identical.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
2019-02-11 1:16 ` Eric Sunshine
@ 2019-02-11 1:20 ` brian m. carlson
0 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-11 1:20 UTC (permalink / raw)
To: Eric Sunshine
Cc: Git List, Lars Schneider, Junio C Hamano,
Torsten Bögershausen, Rich Felker, Kevin Daudt
[-- Attachment #1: Type: text/plain, Size: 843 bytes --]
On Sun, Feb 10, 2019 at 08:16:26PM -0500, Eric Sunshine wrote:
> On Sun, Feb 10, 2019 at 7:23 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > When serializing UTF-16 (and UTF-32), there are three possible ways to
> > write the stream. One can write the data with a BOM in either big-endian
> > or little-endian format, or one can write the data without a BOM in
> > big-endian format.
> > [...]
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
>
> Premature git-send-email invocation? The commit message of v2 seems to
> be a bit different from v1, but the patch itself is identical.
Oof, I forgot to run "git add -u" before running "git commit --amend".
Thanks for catching this; I'll send out a v3 in a second.
--
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-09 20:08 ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
` (2 preceding siblings ...)
2019-02-11 0:23 ` [PATCH v2] " brian m. carlson
@ 2019-02-11 1:26 ` brian m. carlson
2019-02-11 21:43 ` Kevin Daudt
2019-02-12 0:52 ` [PATCH v4] " brian m. carlson
4 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11 1:26 UTC (permalink / raw)
To: git
Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
Torsten Bögershausen, Rich Felker, Kevin Daudt
When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.
Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.
However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.
Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.
Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.
Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
Makefile | 7 +++++++
t/t0028-working-tree-encoding.sh | 30 +++++++++++++++++++++++++++---
utf8.c | 10 ++++++++++
3 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/Makefile b/Makefile
index 571160a2c4..c6172450af 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,10 @@ all::
# Define OLD_ICONV if your library has an old iconv(), where the second
# (input buffer pointer) parameter is declared with type (const char **).
#
+# Define ICONV_OMITS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
+# big-endian format.
+#
# Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
#
# Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1419,9 @@ ifndef NO_ICONV
EXTLIBS += $(ICONV_LINK) -liconv
endif
endif
+ifdef ICONV_OMITS_BOM
+ BASIC_CFLAGS += -DICONV_OMITS_BOM
+endif
ifdef NEEDS_LIBGEN
EXTLIBS += -lgen
endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..8936ba6757 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+test_lazy_prereq NO_UTF16_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+ if test_have_prereq NO_UTF16_BOM
+ then
+ printf '\xfe\xff'
+ fi &&
+ iconv -f UTF-8 -t UTF-16
+}
+
+write_utf32 () {
+ if test_have_prereq NO_UTF32_BOM
+ then
+ printf '\x00\x00\xfe\xff'
+ fi &&
+ iconv -f UTF-8 -t UTF-32
+}
+
test_expect_success 'setup test files' '
git config core.eol lf &&
@@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
printf "$text" >test.utf8.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+ printf "$text" | write_utf16 >test.utf16.raw &&
+ printf "$text" | write_utf32 >test.utf32.raw &&
printf "\377\376" >test.utf16lebom.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
@@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
text="hallo there!\nroundtrip test here!" &&
printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+ printf "$text" | write_utf16 >roundtrip.utf16 &&
echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..5d9a917bc8 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
bom_str = utf16_be_bom;
bom_len = sizeof(utf16_be_bom);
out_encoding = "UTF-16BE";
+#ifdef ICONV_OMITS_BOM
+ } else if (same_utf_encoding("UTF-16", out_encoding)) {
+ bom_str = utf16_be_bom;
+ bom_len = sizeof(utf16_be_bom);
+ out_encoding = "UTF-16BE";
+ } else if (same_utf_encoding("UTF-32", out_encoding)) {
+ bom_str = utf32_be_bom;
+ bom_len = sizeof(utf32_be_bom);
+ out_encoding = "UTF-32BE";
+#endif
}
conv = iconv_open(out_encoding, in_encoding);
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-11 1:26 ` [PATCH v3] " brian m. carlson
@ 2019-02-11 21:43 ` Kevin Daudt
2019-02-11 23:58 ` brian m. carlson
0 siblings, 1 reply; 30+ messages in thread
From: Kevin Daudt @ 2019-02-11 21:43 UTC (permalink / raw)
To: brian m. carlson
Cc: git, Lars Schneider, Junio C Hamano, Eric Sunshine,
Torsten Bögershausen, Rich Felker
On Mon, Feb 11, 2019 at 01:26:39AM +0000, brian m. carlson wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
>
> Most systems' iconv implementations choose to write it with a BOM in
> some endianness, since this is the most foolproof, and it is resistant
> to misinterpretation on Windows, where UTF-16 and the little-endian
> serialization are very common. For compatibility with Windows and to
> avoid accidental misuse there, Git always wants to write UTF-16 with a
> BOM, and will refuse to read UTF-16 without it.
>
> However, musl's iconv implementation writes UTF-16 without a BOM,
> relying on the user to interpret it as big-endian. This causes t0028 and
> the related functionality to fail, since Git won't read the file without
> a BOM.
>
> Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
>
> Update the tests to detect this case and write test data with an added
> BOM if necessary. Always write the BOM in the tests in big-endian
> format, since all iconv implementations that omit a BOM must use
> big-endian serialization according to the Unicode standard.
>
> Preserve the existing behavior for systems which do not have this knob
> enabled, since they may use optimized implementations, including
> defaulting to the native endianness, which may improve performance.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> Makefile | 7 +++++++
> t/t0028-working-tree-encoding.sh | 30 +++++++++++++++++++++++++++---
> utf8.c | 10 ++++++++++
> 3 files changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 571160a2c4..c6172450af 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -259,6 +259,10 @@ all::
> # Define OLD_ICONV if your library has an old iconv(), where the second
> # (input buffer pointer) parameter is declared with type (const char **).
> #
> +# Define ICONV_OMITS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
> +# big-endian format.
> +#
> # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
> #
> # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
> @@ -1415,6 +1419,9 @@ ifndef NO_ICONV
> EXTLIBS += $(ICONV_LINK) -liconv
> endif
> endif
> +ifdef ICONV_OMITS_BOM
> + BASIC_CFLAGS += -DICONV_OMITS_BOM
> +endif
> ifdef NEEDS_LIBGEN
> EXTLIBS += -lgen
> endif
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index e58ecbfc44..8936ba6757 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
>
> GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
>
> +test_lazy_prereq NO_UTF16_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> + test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> + if test_have_prereq NO_UTF16_BOM
> + then
> + printf '\xfe\xff'
> + fi &&
> + iconv -f UTF-8 -t UTF-16
> +}
> +
> +write_utf32 () {
> + if test_have_prereq NO_UTF32_BOM
> + then
> + printf '\x00\x00\xfe\xff'
> + fi &&
> + iconv -f UTF-8 -t UTF-32
> +}
> +
> test_expect_success 'setup test files' '
> git config core.eol lf &&
>
> @@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
> echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
> echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
> printf "$text" >test.utf8.raw &&
> - printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> - printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> + printf "$text" | write_utf16 >test.utf16.raw &&
> + printf "$text" | write_utf32 >test.utf32.raw &&
> printf "\377\376" >test.utf16lebom.raw &&
> printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
>
> @@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>
> text="hallo there!\nroundtrip test here!" &&
> printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> - printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> + printf "$text" | write_utf16 >roundtrip.utf16 &&
> echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>
> # SHIFT-JIS encoded files are round-trip checked by default...
> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..5d9a917bc8 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
> bom_str = utf16_be_bom;
> bom_len = sizeof(utf16_be_bom);
> out_encoding = "UTF-16BE";
> +#ifdef ICONV_OMITS_BOM
> + } else if (same_utf_encoding("UTF-16", out_encoding)) {
> + bom_str = utf16_be_bom;
> + bom_len = sizeof(utf16_be_bom);
> + out_encoding = "UTF-16BE";
> + } else if (same_utf_encoding("UTF-32", out_encoding)) {
> + bom_str = utf32_be_bom;
> + bom_len = sizeof(utf32_be_bom);
> + out_encoding = "UTF-32BE";
> +#endif
> }
>
> conv = iconv_open(out_encoding, in_encoding);
With some additional fixes, this indeed does solve the issue for Alpine
Linux, thanks.
I had to fix the following as well:
iff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 8936ba6757..8491f216aa 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -148,8 +148,8 @@ do
test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
test_when_finished "git reset --hard HEAD^" &&
- cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
- cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+ cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
+ cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
cp crlf.utf${i}.raw eol.utf${i} &&
cat >expectIndexLF <<-EOF &&
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-11 21:43 ` Kevin Daudt
@ 2019-02-11 23:58 ` brian m. carlson
2019-02-12 0:31 ` Junio C Hamano
0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11 23:58 UTC (permalink / raw)
To: Kevin Daudt, git, Lars Schneider, Junio C Hamano, Eric Sunshine,
Torsten Bögershausen, Rich Felker
[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]
On Mon, Feb 11, 2019 at 10:43:06PM +0100, Kevin Daudt wrote:
> With some additional fixes, this indeed does solve the issue for Alpine
> Linux, thanks.
>
> I had to fix the following as well:
>
> iff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index 8936ba6757..8491f216aa 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -148,8 +148,8 @@ do
> test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
> test_when_finished "git reset --hard HEAD^" &&
>
> - cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
> - cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
> + cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
> + cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
> cp crlf.utf${i}.raw eol.utf${i} &&
>
> cat >expectIndexLF <<-EOF &&
I'll squash in this fix, thanks.
--
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-11 23:58 ` brian m. carlson
@ 2019-02-12 0:31 ` Junio C Hamano
2019-02-12 0:53 ` brian m. carlson
0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2019-02-12 0:31 UTC (permalink / raw)
To: brian m. carlson
Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
Torsten Bögershausen, Rich Felker
"brian m. carlson" <sandals@crustytoothpaste.net> writes:
>> - cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
>> - cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
>> + cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
>> + cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
>> cp crlf.utf${i}.raw eol.utf${i} &&
>>
>> cat >expectIndexLF <<-EOF &&
>
> I'll squash in this fix, thanks.
Thanks, all. In the meantime, what I've pushed out has this
applied immediately on top. Unless there is anything else, I could
squash it in in my next pushout I plan to do tonight, before getting
ready to tag -rc1 tomorrow.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-12 0:31 ` Junio C Hamano
@ 2019-02-12 0:53 ` brian m. carlson
2019-02-12 2:43 ` Junio C Hamano
0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-12 0:53 UTC (permalink / raw)
To: Junio C Hamano
Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
Torsten Bögershausen, Rich Felker
[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]
On Mon, Feb 11, 2019 at 04:31:00PM -0800, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
> >> - cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
> >> - cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
> >> + cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
> >> + cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
> >> cp crlf.utf${i}.raw eol.utf${i} &&
> >>
> >> cat >expectIndexLF <<-EOF &&
> >
> > I'll squash in this fix, thanks.
>
> Thanks, all. In the meantime, what I've pushed out has this
> applied immediately on top. Unless there is anything else, I could
> squash it in in my next pushout I plan to do tonight, before getting
> ready to tag -rc1 tomorrow.
I've just sent a v4 with this squashed in. Whether you want to pick that
up or squash this into v3 is up to you.
--
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
2019-02-12 0:53 ` brian m. carlson
@ 2019-02-12 2:43 ` Junio C Hamano
0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2019-02-12 2:43 UTC (permalink / raw)
To: brian m. carlson
Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
Torsten Bögershausen, Rich Felker
"brian m. carlson" <sandals@crustytoothpaste.net> writes:
> I've just sent a v4 with this squashed in. Whether you want to pick that
> up or squash this into v3 is up to you.
Let's take yours, as there is no point doing an eval for these two;
for that matter, braces around ${i} are also pointless, but I'll let
them pass.
Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v4] utf8: handle systems that don't write BOM for UTF-16
2019-02-09 20:08 ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
` (3 preceding siblings ...)
2019-02-11 1:26 ` [PATCH v3] " brian m. carlson
@ 2019-02-12 0:52 ` brian m. carlson
4 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-12 0:52 UTC (permalink / raw)
To: git
Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
Torsten Bögershausen, Rich Felker, Kevin Daudt
When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.
Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.
However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.
Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.
Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.
Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
Makefile | 7 +++++++
t/t0028-working-tree-encoding.sh | 34 +++++++++++++++++++++++++++-----
utf8.c | 14 +++++++++++++
3 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/Makefile b/Makefile
index 571160a2c4..c6172450af 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,10 @@ all::
# Define OLD_ICONV if your library has an old iconv(), where the second
# (input buffer pointer) parameter is declared with type (const char **).
#
+# Define ICONV_OMITS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
+# big-endian format.
+#
# Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
#
# Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1419,9 @@ ifndef NO_ICONV
EXTLIBS += $(ICONV_LINK) -liconv
endif
endif
+ifdef ICONV_OMITS_BOM
+ BASIC_CFLAGS += -DICONV_OMITS_BOM
+endif
ifdef NEEDS_LIBGEN
EXTLIBS += -lgen
endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..500229a9bd 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+test_lazy_prereq NO_UTF16_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+ test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+ if test_have_prereq NO_UTF16_BOM
+ then
+ printf '\xfe\xff'
+ fi &&
+ iconv -f UTF-8 -t UTF-16
+}
+
+write_utf32 () {
+ if test_have_prereq NO_UTF32_BOM
+ then
+ printf '\x00\x00\xfe\xff'
+ fi &&
+ iconv -f UTF-8 -t UTF-32
+}
+
test_expect_success 'setup test files' '
git config core.eol lf &&
@@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
printf "$text" >test.utf8.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
- printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+ printf "$text" | write_utf16 >test.utf16.raw &&
+ printf "$text" | write_utf32 >test.utf32.raw &&
printf "\377\376" >test.utf16lebom.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
@@ -124,8 +148,8 @@ do
test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
test_when_finished "git reset --hard HEAD^" &&
- cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
- cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+ cat lf.utf8.raw | write_utf${i} >lf.utf${i}.raw &&
+ cat crlf.utf8.raw | write_utf${i} >crlf.utf${i}.raw &&
cp crlf.utf${i}.raw eol.utf${i} &&
cat >expectIndexLF <<-EOF &&
@@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
text="hallo there!\nroundtrip test here!" &&
printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
- printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+ printf "$text" | write_utf16 >roundtrip.utf16 &&
echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..3b42fadffd 100644
--- a/utf8.c
+++ b/utf8.c
@@ -559,6 +559,10 @@ char *reencode_string_len(const char *in, size_t insz,
/*
* For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
* Some users under Windows want the little endian version
+ *
+ * We handle UTF-16 and UTF-32 ourselves only if the platform does not
+ * provide a BOM (which we require), since we want to match the behavior
+ * of the system tools and libc as much as possible.
*/
if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
bom_str = utf16_le_bom;
@@ -568,6 +572,16 @@ char *reencode_string_len(const char *in, size_t insz,
bom_str = utf16_be_bom;
bom_len = sizeof(utf16_be_bom);
out_encoding = "UTF-16BE";
+#ifdef ICONV_OMITS_BOM
+ } else if (same_utf_encoding("UTF-16", out_encoding)) {
+ bom_str = utf16_be_bom;
+ bom_len = sizeof(utf16_be_bom);
+ out_encoding = "UTF-16BE";
+ } else if (same_utf_encoding("UTF-32", out_encoding)) {
+ bom_str = utf32_be_bom;
+ bom_len = sizeof(utf32_be_bom);
+ out_encoding = "UTF-32BE";
+#endif
}
conv = iconv_open(out_encoding, in_encoding);
^ permalink raw reply related [flat|nested] 30+ messages in thread