t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
@ 2019-02-07 21:59 Kevin Daudt
  2019-02-08  0:17 ` brian m. carlson
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Daudt @ 2019-02-07 21:59 UTC (permalink / raw)
  To: git; +Cc: larsxschneider

I'm trying to get the git test suite passing on Alpine Linux, which is
based on musl libc.

All tests in t0028-working-tree-encoding.sh are currently failing,
because musl iconv does not support statefull output of UTF-16/32 (eg,
it does not output a BOM), while git is expecting that to be present:

> hint: The file 'test.utf16' is missing a byte order mark (BOM). Please
> use UTF-16BE or UTF-16LE (depending on the byte order) as
> working-tree-encoding.
> fatal: BOM is required in 'test.utf16' if encoded as utf-16

Because adding the file to get fails, all the other tests fail as well
as they expect the file to be present in the repository.

Any idea how to get around this?

Kevin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-07 21:59 t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Kevin Daudt
@ 2019-02-08  0:17 ` brian m. carlson
  2019-02-08  6:04   ` Rich Felker
  0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-08  0:17 UTC (permalink / raw)
  To: Kevin Daudt; +Cc: git, larsxschneider

[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]

[Please skip using Reply-To and instead of Mail-Followup-To so that
responses also go to the list.]

On Thu, Feb 07, 2019 at 10:59:35PM +0100, Kevin Daudt wrote:
> I'm trying to get the git test suite passing on Alpine Linux, which is
> based on musl libc.
> 
> All tests in t0028-working-tree-encoding.sh are currently failing,
> because musl iconv does not support statefull output of UTF-16/32 (eg,
> it does not output a BOM), while git is expecting that to be present:
> 
> > hint: The file 'test.utf16' is missing a byte order mark (BOM). Please
> > use UTF-16BE or UTF-16LE (depending on the byte order) as
> > working-tree-encoding.
> > fatal: BOM is required in 'test.utf16' if encoded as utf-16
> 
> Because adding the file to get fails, all the other tests fail as well
> as they expect the file to be present in the repository.
> 
> Any idea how to get around this?

I think musl needs to patch their libc. RFC 2781 says that if there's no
BOM in UTF-16, then "the text SHOULD be interpreted as being
big-endian."

Unfortunately for all of us, many Windows-based programs have chosen to
ignore that advice (technically, it's only a SHOULD) and interpret it as
little-endian instead. Git can't safely assume anything about the
endianness of a UTF-16 stream that doesn't contain a BOM. Technically,
since the RFC doesn't specify a MUST requirement, musl can't, either.

Even if Git were to produce a BOM to work around this issue, then we'd
still have the problem that any program using musl will write data in
UTF-16 without a BOM. Moreover, because musl, in violation of the RFC,
doesn't read and process BOMs, someone using little-endian UTF-16 (with
a proper BOM) with musl and Git will have their data corrupted,
according to my reading of the musl website.

In other words, I believe this test is failing legitimately.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08  0:17 ` brian m. carlson
@ 2019-02-08  6:04   ` Rich Felker
  2019-02-08 11:45     ` brian m. carlson
  2019-02-09  8:09     ` Torsten Bögershausen
  0 siblings, 2 replies; 30+ messages in thread
From: Rich Felker @ 2019-02-08  6:04 UTC (permalink / raw)
  To: brian m. carlson, Kevin Daudt, git, larsxschneider

On Fri, Feb 08, 2019 at 12:17:05AM +0000, brian m. carlson wrote:
> [Please skip using Reply-To and instead of Mail-Followup-To so that
> responses also go to the list.]
> 
> On Thu, Feb 07, 2019 at 10:59:35PM +0100, Kevin Daudt wrote:
> > I'm trying to get the git test suite passing on Alpine Linux, which is
> > based on musl libc.
> > 
> > All tests in t0028-working-tree-encoding.sh are currently failing,
> > because musl iconv does not support statefull output of UTF-16/32 (eg,
> > it does not output a BOM), while git is expecting that to be present:
> > 
> > > hint: The file 'test.utf16' is missing a byte order mark (BOM). Please
> > > use UTF-16BE or UTF-16LE (depending on the byte order) as
> > > working-tree-encoding.
> > > fatal: BOM is required in 'test.utf16' if encoded as utf-16
> > 
> > Because adding the file to get fails, all the other tests fail as well
> > as they expect the file to be present in the repository.
> > 
> > Any idea how to get around this?
> 
> I think musl needs to patch their libc. RFC 2781 says that if there's no
> BOM in UTF-16, then "the text SHOULD be interpreted as being
> big-endian."
> 
> Unfortunately for all of us, many Windows-based programs have chosen to
> ignore that advice (technically, it's only a SHOULD) and interpret it as
> little-endian instead. Git can't safely assume anything about the
> endianness of a UTF-16 stream that doesn't contain a BOM. Technically,
> since the RFC doesn't specify a MUST requirement, musl can't, either.
> 
> Even if Git were to produce a BOM to work around this issue, then we'd
> still have the problem that any program using musl will write data in
> UTF-16 without a BOM. Moreover, because musl, in violation of the RFC,
> doesn't read and process BOMs, someone using little-endian UTF-16 (with
> a proper BOM) with musl and Git will have their data corrupted,
> according to my reading of the musl website.

That information is outdated and someone from our side should update
it; since 1.1.19, musl treats "UTF-16" input as ambiguous endianness
determined by BOM, defaulting to big if there's no BOM. However output
is always big endian, such that processes conforming to the Unicode
SHOULD clause will interpret it correctly.

The portable way to get little endian with a BOM is to open a
conversion descriptor for "UTF-16LE" (which should not add any BOM)
and write a BOM manually.

In any case, this test seems mainly relevant to Windows users wanting
to store source files in UTF-16LE with BOM. This doesn't really make
sense to do on a Linux/musl system, so I'm not sure any action is
needed here from either side.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08  6:04   ` Rich Felker
@ 2019-02-08 11:45     ` brian m. carlson
  2019-02-08 11:55       ` Kevin Daudt
  2019-02-09  8:09     ` Torsten Bögershausen
  1 sibling, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-08 11:45 UTC (permalink / raw)
  To: Rich Felker; +Cc: Kevin Daudt, git, larsxschneider

[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]

On Fri, Feb 08, 2019 at 01:04:03AM -0500, Rich Felker wrote:
> That information is outdated and someone from our side should update
> it; since 1.1.19, musl treats "UTF-16" input as ambiguous endianness
> determined by BOM, defaulting to big if there's no BOM. However output
> is always big endian, such that processes conforming to the Unicode
> SHOULD clause will interpret it correctly.

It's good to hear that musl now supports parsing UTF-16 BOMs.

> The portable way to get little endian with a BOM is to open a
> conversion descriptor for "UTF-16LE" (which should not add any BOM)
> and write a BOM manually.

Right, I think my point is that we have existing systems which we know
ignore the SHOULD and assume something different. Perhaps in retrospect,
it would have been better to use MUST to specify areas where
interoperability is a concern.

> In any case, this test seems mainly relevant to Windows users wanting
> to store source files in UTF-16LE with BOM. This doesn't really make
> sense to do on a Linux/musl system, so I'm not sure any action is
> needed here from either side.

I do know that some people use CIFS or the like to share repositories
between Unix and Windows. However, I agree that such people aren't
likely to use UTF-16 on Unix systems. The working tree encoding
functionality also supports other encodings which musl may or may not
support.

If you and your users are comfortable with the fact that the test (and
the corresponding functionality) won't work as expected with UTF-16,
then I agree that no action is needed.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 11:45     ` brian m. carlson
@ 2019-02-08 11:55       ` Kevin Daudt
  2019-02-08 13:51         ` brian m. carlson
  2019-02-08 16:13         ` t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Rich Felker
  0 siblings, 2 replies; 30+ messages in thread
From: Kevin Daudt @ 2019-02-08 11:55 UTC (permalink / raw)
  To: brian m. carlson, Rich Felker, Kevin Daudt, git, larsxschneider

On Fri, Feb 08, 2019 at 11:45:02AM +0000, brian m. carlson wrote:
> On Fri, Feb 08, 2019 at 01:04:03AM -0500, Rich Felker wrote:
> [..]
> > In any case, this test seems mainly relevant to Windows users wanting
> > to store source files in UTF-16LE with BOM. This doesn't really make
> > sense to do on a Linux/musl system, so I'm not sure any action is
> > needed here from either side.
> 
> I do know that some people use CIFS or the like to share repositories
> between Unix and Windows. However, I agree that such people aren't
> likely to use UTF-16 on Unix systems. The working tree encoding
> functionality also supports other encodings which musl may or may not
> support.
> 
> If you and your users are comfortable with the fact that the test (and
> the corresponding functionality) won't work as expected with UTF-16,
> then I agree that no action is needed.

So would you suggest that we just skip this test on Alpine Linux?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 11:55       ` Kevin Daudt
@ 2019-02-08 13:51         ` brian m. carlson
  2019-02-08 17:50           ` Junio C Hamano
  2019-02-08 16:13         ` t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Rich Felker
  1 sibling, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-08 13:51 UTC (permalink / raw)
  To: Kevin Daudt; +Cc: Rich Felker, Kevin Daudt, git, larsxschneider

[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]

On Fri, Feb 08, 2019 at 12:55:11PM +0100, Kevin Daudt wrote:
> On Fri, Feb 08, 2019 at 11:45:02AM +0000, brian m. carlson wrote:
> > On Fri, Feb 08, 2019 at 01:04:03AM -0500, Rich Felker wrote:
> > [..]
> > > In any case, this test seems mainly relevant to Windows users wanting
> > > to store source files in UTF-16LE with BOM. This doesn't really make
> > > sense to do on a Linux/musl system, so I'm not sure any action is
> > > needed here from either side.
> > 
> > I do know that some people use CIFS or the like to share repositories
> > between Unix and Windows. However, I agree that such people aren't
> > likely to use UTF-16 on Unix systems. The working tree encoding
> > functionality also supports other encodings which musl may or may not
> > support.
> > 
> > If you and your users are comfortable with the fact that the test (and
> > the corresponding functionality) won't work as expected with UTF-16,
> > then I agree that no action is needed.
> 
> So would you suggest that we just skip this test on Alpine Linux?

That's not exactly what I said. If Alpine Linux users are never going to
use this functionality and don't care that it's broken, then that's a
fine solution.

As originally mentioned, musl could change its libiconv to write a BOM,
which would make it compatible with other known iconv implementations.

There's also the possibility of defining NO_ICONV. That basically means
that your system won't support encodings, and then this test shouldn't
matter.

Finally, you could try applying a patch to the test to make it write the
BOM for UTF-16 since your iconv doesn't. I expect that the test will
fail again later on once you've done that, though.

I don't have musl so I can't test a patch, but theoretically, you could
use a Makefile variable and lazy test prerequisite to control the
behavior of the code and test, and we could apply a patch. I'll try to
work something up later today which might work and which you could test.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 11:55       ` Kevin Daudt
  2019-02-08 13:51         ` brian m. carlson
@ 2019-02-08 16:13         ` Rich Felker
  1 sibling, 0 replies; 30+ messages in thread
From: Rich Felker @ 2019-02-08 16:13 UTC (permalink / raw)
  To: Kevin Daudt; +Cc: brian m. carlson, Kevin Daudt, git, larsxschneider

On Fri, Feb 08, 2019 at 12:55:11PM +0100, Kevin Daudt wrote:
> On Fri, Feb 08, 2019 at 11:45:02AM +0000, brian m. carlson wrote:
> > On Fri, Feb 08, 2019 at 01:04:03AM -0500, Rich Felker wrote:
> > [..]
> > > In any case, this test seems mainly relevant to Windows users wanting
> > > to store source files in UTF-16LE with BOM. This doesn't really make
> > > sense to do on a Linux/musl system, so I'm not sure any action is
> > > needed here from either side.
> > 
> > I do know that some people use CIFS or the like to share repositories
> > between Unix and Windows. However, I agree that such people aren't
> > likely to use UTF-16 on Unix systems. The working tree encoding
> > functionality also supports other encodings which musl may or may not
> > support.
> > 
> > If you and your users are comfortable with the fact that the test (and
> > the corresponding functionality) won't work as expected with UTF-16,
> > then I agree that no action is needed.
> 
> So would you suggest that we just skip this test on Alpine Linux?

I'm fine with that as an outcome here. Admittedly I'd rather see git
do this in a way that doesn't make assumptions about what an ambiguous
"UTF-16" encoding argument to iconv_open does, but nothing is actually
breaking because of this.

Rich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 13:51         ` brian m. carlson
@ 2019-02-08 17:50           ` Junio C Hamano
  2019-02-08 20:23             ` Kevin Daudt
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2019-02-08 17:50 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Kevin Daudt, Rich Felker, Kevin Daudt, git, larsxschneider

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> So would you suggest that we just skip this test on Alpine Linux?
>
> That's not exactly what I said. If Alpine Linux users are never going to
> use this functionality and don't care that it's broken, then that's a
> fine solution.
>
> As originally mentioned, musl could change its libiconv to write a BOM,
> which would make it compatible with other known iconv implementations.
>
> There's also the possibility of defining NO_ICONV. That basically means
> that your system won't support encodings, and then this test shouldn't
> matter.
>
> Finally, you could try applying a patch to the test to make it write the
> BOM for UTF-16 since your iconv doesn't. I expect that the test will
> fail again later on once you've done that, though.

Sorry for being late to the party, but is the crux of the issue this
piece early in the test?

    printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
    ...
    cp test.utf16.raw test.utf16 &&
    ...
    git add .gitattributes test.utf16 test.utf16lebom &&

where we expect "iconv -t UTF-16" means "write UTF16 in whatever
byteorder of your choice, but do write BOM", and iconv
implementations we have seen so far are in line with that
expectation, but the one on Apline writes UTF16 in big endian
without BOM?

If that is the case, I think it is our expectation that is at fault
in this case, as I think the most natural interpretation of "UTF-16"
without any modifiers (like "BE") ought to be "UTF16 stream
expressed in any way of writers choice, as long as it is readable by
standard compliant readers", in other words, "write UTF16 in
whatever byteorder of your choice, with or without BOM, but if you
omit BOM, you SHOULD write in big endian".  So

 - If our later test assumes that test.utf16 is UTF16 with BOM, that
   already assumes too much;

 - If our later test assumes that test.utf16 is UTF16 in big endian,
   that assumes too much, too.

As suggested earlier in the thread, the easiest workaround would be
to update the preparation of test.utf16.raw may to force big endian
with BOM by preprending BE-BOM by hand before "iconv -t UTF-32BE"
output (I am assuming that UTF-32BE will stay to be "big endian
without BOM" in the future).  That would make sure that the
assumption later tests have on test.utf16 is held true.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 17:50           ` Junio C Hamano
@ 2019-02-08 20:23             ` Kevin Daudt
  2019-02-08 20:42               ` brian m. carlson
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Daudt @ 2019-02-08 20:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Rich Felker, git, larsxschneider

On Fri, Feb 08, 2019 at 09:50:07AM -0800, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >> So would you suggest that we just skip this test on Alpine Linux?
> >
> > That's not exactly what I said. If Alpine Linux users are never going to
> > use this functionality and don't care that it's broken, then that's a
> > fine solution.
> >
> > As originally mentioned, musl could change its libiconv to write a BOM,
> > which would make it compatible with other known iconv implementations.
> >
> > There's also the possibility of defining NO_ICONV. That basically means
> > that your system won't support encodings, and then this test shouldn't
> > matter.
> >
> > Finally, you could try applying a patch to the test to make it write the
> > BOM for UTF-16 since your iconv doesn't. I expect that the test will
> > fail again later on once you've done that, though.
> 
> Sorry for being late to the party, but is the crux of the issue this
> piece early in the test?
> 
>     printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
>     ...
>     cp test.utf16.raw test.utf16 &&
>     ...
>     git add .gitattributes test.utf16 test.utf16lebom &&
> 
> where we expect "iconv -t UTF-16" means "write UTF16 in whatever
> byteorder of your choice, but do write BOM", and iconv
> implementations we have seen so far are in line with that
> expectation, but the one on Apline writes UTF16 in big endian
> without BOM?

Firstly, the tests expect iconv -t UTF-16 to output a BOM, which it
indeed does not do on Alpine. Secondly, git itself also expects the BOM
to be present when the encoding is set to UTF-16, otherwise it will
complain.

> 
> If that is the case, I think it is our expectation that is at fault
> in this case, as I think the most natural interpretation of "UTF-16"
> without any modifiers (like "BE") ought to be "UTF16 stream
> expressed in any way of writers choice, as long as it is readable by
> standard compliant readers", in other words, "write UTF16 in
> whatever byteorder of your choice, with or without BOM, but if you
> omit BOM, you SHOULD write in big endian".  So
> 
>  - If our later test assumes that test.utf16 is UTF16 with BOM, that
>    already assumes too much;
> 
>  - If our later test assumes that test.utf16 is UTF16 in big endian,
>    that assumes too much, too.
> 
> As suggested earlier in the thread, the easiest workaround would be
> to update the preparation of test.utf16.raw may to force big endian
> with BOM by preprending BE-BOM by hand before "iconv -t UTF-32BE"
> output (I am assuming that UTF-32BE will stay to be "big endian
> without BOM" in the future).  That would make sure that the
> assumption later tests have on test.utf16 is held true.

I tried change the test to manually inject a BOM to the file (and
setting iconv to UTF-16LE / UTF16-BE, which lets the first test go
through, but test 3 then fails, because git itself output the file
without BOM, presumably because it's passed through iconv.

So I'm not sure if it's a matter of just fixing the tests.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 20:23             ` Kevin Daudt
@ 2019-02-08 20:42               ` brian m. carlson
  2019-02-08 23:12                 ` Junio C Hamano
  2019-02-09 14:57                 ` Kevin Daudt
  0 siblings, 2 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-08 20:42 UTC (permalink / raw)
  To: Junio C Hamano, Rich Felker, git, larsxschneider

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

On Fri, Feb 08, 2019 at 09:23:36PM +0100, Kevin Daudt wrote:
> Firstly, the tests expect iconv -t UTF-16 to output a BOM, which it
> indeed does not do on Alpine. Secondly, git itself also expects the BOM
> to be present when the encoding is set to UTF-16, otherwise it will
> complain.

Yeah, we definitely want to require a BOM for UTF-16. As previously
mentioned, it isn't safe for us to assume big-endian when it's missing.

> I tried change the test to manually inject a BOM to the file (and
> setting iconv to UTF-16LE / UTF16-BE, which lets the first test go
> through, but test 3 then fails, because git itself output the file
> without BOM, presumably because it's passed through iconv.
> 
> So I'm not sure if it's a matter of just fixing the tests.

I think something like the following will likely work in this scenario:

------ %< ---------
From: "brian m. carlson" <sandals@crustytoothpaste.net>
Date: Fri, 8 Feb 2019 12:58:11 +0000
Subject: [PATCH] WIP: utf8: handle missing musl UTF-16 BOM

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 t/t0028-working-tree-encoding.sh | 20 ++++++++++++++++++--
 utf8.c                           |  4 ++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..ff02d03bad 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,22 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_BOM '
+	printf abc | iconv -f UTF-8 -t UTF-16 &&
+	test $(wc -c) = 6
+'
+
+write_utf16 () {
+	test_have_prereq NO_BOM && printf '\xfe\xff'
+	iconv -f UTF-8 -t UTF-16
+
+}
+
+write_utf32 () {
+	test_have_prereq NO_BOM && printf '\x00\x00\xfe\xff'
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +29,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..4aa69cd65b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,10 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);
------ %< ---------

This passes for me on glibc, but only on a little-endian system. If this
works for musl folks, then I'll add a config option for those people who
have UTF-16 without BOM.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 20:42               ` brian m. carlson
@ 2019-02-08 23:12                 ` Junio C Hamano
  2019-02-09  0:24                   ` brian m. carlson
  2019-02-09 14:57                 ` Kevin Daudt
  1 sibling, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2019-02-08 23:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Rich Felker, git, larsxschneider

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +test_lazy_prereq NO_BOM '
> +	printf abc | iconv -f UTF-8 -t UTF-16 &&
> +	test $(wc -c) = 6
> +'

This must be "just for illustration of idea" patch?  The pipeline
goes to the standard output, and nobody feeds "wc".

But I think I got the idea.

In the real implementation, it probably is a good idea to allow
NO_BOM16 and NO_BOM32 be orthogonal.


> +
> +write_utf16 () {
> +	test_have_prereq NO_BOM && printf '\xfe\xff'
> +	iconv -f UTF-8 -t UTF-16

This assumes "iconv -t UTF-16" on the platform gives little endian
(with or without BOM), which may not be a good assumption.

If you are forcing the world to be where UTF-16 (no other
specificaiton) means LE with BOM, then perhaps doing

	printf '\xfe\xff'; iconv -f UTF-8 -t UTF-16LE

without any lazy prereq may be more explicit and in line with what
you did in utf8.c::reencode_string_len() below.

> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> +	printf "$text" | write_utf16 >test.utf16.raw &&
> +	printf "$text" | write_utf32 >test.utf32.raw &&

> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..4aa69cd65b 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,10 @@ char *reencode_string_len(const char *in, size_t insz,
>  		bom_str = utf16_be_bom;
>  		bom_len = sizeof(utf16_be_bom);
>  		out_encoding = "UTF-16BE";
> +	} else if (same_utf_encoding("UTF-16", out_encoding)) {
> +		bom_str = utf16_le_bom;
> +		bom_len = sizeof(utf16_le_bom);
> +		out_encoding = "UTF-16LE";
>  	}

I am not sure what is going on here.  When the caller asks for
"UTF-16", we do not let the platform implementation of iconv() to
pick one of the allowed ones (i.e. BE with BOM, LE with BOM, or BE
without BOM) but instead force LE with BOM?

>  	conv = iconv_open(out_encoding, in_encoding);
> ------ %< ---------
>
> This passes for me on glibc, but only on a little-endian system. If this
> works for musl folks, then I'll add a config option for those people who
> have UTF-16 without BOM.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 23:12                 ` Junio C Hamano
@ 2019-02-09  0:24                   ` brian m. carlson
  0 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-09  0:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Rich Felker, git, larsxschneider

[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]

On Fri, Feb 08, 2019 at 03:12:22PM -0800, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > +test_lazy_prereq NO_BOM '
> > +	printf abc | iconv -f UTF-8 -t UTF-16 &&
> > +	test $(wc -c) = 6
> > +'
> 
> This must be "just for illustration of idea" patch?  The pipeline
> goes to the standard output, and nobody feeds "wc".

Well, as I said, I have no way to test this. This code path works fine
on glibc because of course we would never exercise the prerequisite.

But I do appreciate you pointing it out. I'll fix it and send another
test patch.

> But I think I got the idea.
> 
> In the real implementation, it probably is a good idea to allow
> NO_BOM16 and NO_BOM32 be orthogonal.

Sure.

> > +
> > +write_utf16 () {
> > +	test_have_prereq NO_BOM && printf '\xfe\xff'
> > +	iconv -f UTF-8 -t UTF-16
> 
> This assumes "iconv -t UTF-16" on the platform gives little endian
> (with or without BOM), which may not be a good assumption.
> 
> If you are forcing the world to be where UTF-16 (no other
> specificaiton) means LE with BOM, then perhaps doing
> 
> 	printf '\xfe\xff'; iconv -f UTF-8 -t UTF-16LE
> 
> without any lazy prereq may be more explicit and in line with what
> you did in utf8.c::reencode_string_len() below.

No, I believe it assumes big-endian (BOM is U+FEFF). The only
justifiable argument for endianness that doesn't include a BOM is
big-endian, because that's the only one supported by the RFC and the
Unicode standard.

I can't actually write it without the prerequisite because I don't plan
to trigger the code I wrote below unless required. The endianness we
pick in the code and the tests has to be the same. If we're on glibc,
the code will use the glibc implementation, which is little-endian, and
therefore will need the tests to be little-endian as well.

I will explain this thoroughly in the commit message, because it is
indeed quite subtle.

> > diff --git a/utf8.c b/utf8.c
> > index 83824dc2f4..4aa69cd65b 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -568,6 +568,10 @@ char *reencode_string_len(const char *in, size_t insz,
> >  		bom_str = utf16_be_bom;
> >  		bom_len = sizeof(utf16_be_bom);
> >  		out_encoding = "UTF-16BE";
> > +	} else if (same_utf_encoding("UTF-16", out_encoding)) {
> > +		bom_str = utf16_le_bom;
> > +		bom_len = sizeof(utf16_le_bom);
> > +		out_encoding = "UTF-16LE";
> >  	}
> 
> I am not sure what is going on here.  When the caller asks for
> "UTF-16", we do not let the platform implementation of iconv() to
> pick one of the allowed ones (i.e. BE with BOM, LE with BOM, or BE
> without BOM) but instead force LE with BOM?

This is more of a test scenario to see how it works for musl. My
proposal is that if this is sufficient to fix problems for musl, then we
wrap it inside a #define (and Makefile) knob, and let users with an
iconv that doesn't write the BOM turn it on. Now that I'm thinking about
it, it will probably need to be big-endian for compatibility with the
tests.

I plan to treat this as a platform-specific wart much like
FREAD_READS_DIRECTORIES. Inspecting the stream would be complicated and
not performant, and I'm not aware of any other iconv implementations
that have this behavior (because it causes unhappiness with Windows,
which is the primary consumer of UTF-16), so I think a compile-time
option is the way to go.

I'll try to reroll with a formal test patch this evening or tomorrow.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08  6:04   ` Rich Felker
  2019-02-08 11:45     ` brian m. carlson
@ 2019-02-09  8:09     ` Torsten Bögershausen
  1 sibling, 0 replies; 30+ messages in thread
From: Torsten Bögershausen @ 2019-02-09  8:09 UTC (permalink / raw)
  To: Rich Felker, brian m. carlson, Kevin Daudt, git, larsxschneider

On 08.02.19 07:04, Rich Felker wrote:
> On Fri, Feb 08, 2019 at 12:17:05AM +0000, brian m. carlson wrote:

[]
>> Even if Git were to produce a BOM to work around this issue, then we'd
>> still have the problem that any program using musl will write data in
>> UTF-16 without a BOM. Moreover, because musl, in violation of the RFC,
>> doesn't read and process BOMs, someone using little-endian UTF-16 (with
>> a proper BOM) with musl and Git will have their data corrupted,
>> according to my reading of the musl website.
>
> That information is outdated and someone from our side should update
> it; since 1.1.19, musl treats "UTF-16" input as ambiguous endianness
> determined by BOM, defaulting to big if there's no BOM. However output
> is always big endian, such that processes conforming to the Unicode
> SHOULD clause will interpret it correctly.
>
> The portable way to get little endian with a BOM is to open a
> conversion descriptor for "UTF-16LE" (which should not add any BOM)
> and write a BOM manually.
>

That is possible in the next upcoming version of Git:

commit 0fa3cc77ee9fb3b6bb53c73688c9b7500f996b83
Merge: cfd9167c15 aab2a1ae48
Author: Junio C Hamano <gitster@pobox.com>
Date:   Wed Feb 6 22:05:21 2019 -0800

    Merge branch 'tb/utf-16-le-with-explicit-bom'

    A new encoding UTF-16LE-BOM has been invented to force encoding to
    UTF-16 with BOM in little endian byte order, which cannot be directly
    generated by using iconv.

    * tb/utf-16-le-with-explicit-bom:
      Support working-tree-encoding "UTF-16LE-BOM"



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux)
  2019-02-08 20:42               ` brian m. carlson
  2019-02-08 23:12                 ` Junio C Hamano
@ 2019-02-09 14:57                 ` Kevin Daudt
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
  1 sibling, 1 reply; 30+ messages in thread
From: Kevin Daudt @ 2019-02-09 14:57 UTC (permalink / raw)
  To: brian m. carlson, Junio C Hamano, Rich Felker, git,
	larsxschneider

On Fri, Feb 08, 2019 at 08:42:19PM +0000, brian m. carlson wrote:
> On Fri, Feb 08, 2019 at 09:23:36PM +0100, Kevin Daudt wrote:
> > Firstly, the tests expect iconv -t UTF-16 to output a BOM, which it
> > indeed does not do on Alpine. Secondly, git itself also expects the BOM
> > to be present when the encoding is set to UTF-16, otherwise it will
> > complain.
> 
> Yeah, we definitely want to require a BOM for UTF-16. As previously
> mentioned, it isn't safe for us to assume big-endian when it's missing.
> 
> > I tried change the test to manually inject a BOM to the file (and
> > setting iconv to UTF-16LE / UTF16-BE, which lets the first test go
> > through, but test 3 then fails, because git itself output the file
> > without BOM, presumably because it's passed through iconv.
> > 
> > So I'm not sure if it's a matter of just fixing the tests.
> 
> I think something like the following will likely work in this scenario:
> 
> [..]
> 
> This passes for me on glibc, but only on a little-endian system. If this
> works for musl folks, then I'll add a config option for those people who
> have UTF-16 without BOM.
> -- 
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204

I tried your patch. The pre-requisite is broken in it's current form,
this would fix the prerequisite:

    diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
    index ff02d03bad..734c5a7dbb 100755
    --- a/t/t0028-working-tree-encoding.sh
    +++ b/t/t0028-working-tree-encoding.sh
    @@ -7,8 +7,7 @@ test_description='working-tree-encoding conversion via gitattributes'
     GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING

     test_lazy_prereq NO_BOM '
    -       printf abc | iconv -f UTF-8 -t UTF-16 &&
    -       test $(wc -c) = 6
    +       test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
     '

     write_utf16 () {

But test 3 still fails, because now the output from git is converted to
UTF16-LE, which is different from the input, which is UTF16-BE.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 14:57                 ` Kevin Daudt
@ 2019-02-09 20:08                   ` brian m. carlson
  2019-02-10  1:45                     ` Eric Sunshine
                                       ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-09 20:08 UTC (permalink / raw)
  To: git; +Cc: larsxschneider, Rich Felker, Junio C Hamano, Kevin Daudt

When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.

Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.

Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.

Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.

Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, to gain improved performance, which
can be significant with large checkouts.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile                         |  6 ++++++
 t/t0028-working-tree-encoding.sh | 25 ++++++++++++++++++++++---
 utf8.c                           | 10 ++++++++++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 571160a2c4..b2a4765e5f 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,9 @@ all::
 # Define OLD_ICONV if your library has an old iconv(), where the second
 # (input buffer pointer) parameter is declared with type (const char **).
 #
+# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32.
+#
 # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
 #
 # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1418,9 @@ ifndef NO_ICONV
 		EXTLIBS += $(ICONV_LINK) -liconv
 	endif
 endif
+ifdef ICONV_NEEDS_BOM
+	BASIC_CFLAGS += -DICONV_NEEDS_BOM
+endif
 ifdef NEEDS_LIBGEN
 	EXTLIBS += -lgen
 endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..bfc4a9d4dd 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_UTF16_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+	test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
+	iconv -f UTF-8 -t UTF-16
+
+}
+
+write_utf32 () {
+	test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +32,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
@@ -223,7 +242,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
 
 	text="hallo there!\nroundtrip test here!" &&
 	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	printf "$text" | write_utf16 >roundtrip.utf16 &&
 	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
 
 	# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..133199de0e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+#ifdef ICONV_NEEDS_BOM
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-32", out_encoding)) {
+		bom_str = utf32_be_bom;
+		bom_len = sizeof(utf32_be_bom);
+		out_encoding = "UTF-32BE";
+#endif
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
@ 2019-02-10  1:45                     ` Eric Sunshine
  2019-02-10 18:14                       ` brian m. carlson
  2019-02-10  8:04                     ` Torsten Bögershausen
                                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Eric Sunshine @ 2019-02-10  1:45 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Git List, Lars Schneider, Rich Felker, Junio C Hamano,
	Kevin Daudt

On Sat, Feb 9, 2019 at 3:08 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> [...]
> Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.

The name ICONV_NEEDS_BOM makes it sound as if we must feed a BOM
_into_ 'iconv', which is quite confusing since the actual intention is
that 'iconv' doesn't emit a BOM and we need to make up for the
deficiency. Using a name such as ICONV_OMITS_BOM or ICONV_NEGLECTS_BOM
makes it somewhat clearer that there is some deficiency with which we
need to deal.

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/Makefile b/Makefile
> @@ -259,6 +259,9 @@ all::
> +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32.

Not a big deal, but I wonder if it would be helpful to tack on "...,
in which case it outputs big-endian unconditionally." or something.

> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
> +test_lazy_prereq NO_UTF16_BOM '
> +       test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> +       test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> +       test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
> +       iconv -f UTF-8 -t UTF-16
> +
> +}

Stray blank line before the closing brace.

> +
> +write_utf32 () {
> +       test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
> +       iconv -f UTF-8 -t UTF-32
> +}

It's probably doesn't matter much with these two tiny functions, but I
was wondering if it would make sense to maintain the &&-chain, perhaps
like this:

    if test test_have_prereq NO_UTF32_BOM
    then
        printf '\x00\x00\xfe\xff'
    fi &&
    iconv -f UTF-8 -t UTF-32

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
  2019-02-10  1:45                     ` Eric Sunshine
@ 2019-02-10  8:04                     ` Torsten Bögershausen
  2019-02-10 18:55                       ` brian m. carlson
  2019-02-11  0:23                     ` [PATCH v2] " brian m. carlson
                                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Torsten Bögershausen @ 2019-02-10  8:04 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, larsxschneider, Rich Felker, Junio C Hamano, Kevin Daudt

On Sat, Feb 09, 2019 at 08:08:01PM +0000, brian m. carlson wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
>
> Most systems' iconv implementations choose to write it with a BOM in
> some endianness, since this is the most foolproof, and it is resistant
> to misinterpretation on Windows, where UTF-16 and the little-endian
> serialization are very common. For compatibility with Windows and to
> avoid accidental misuse there, Git always wants to write UTF-16 with a
> BOM, and will refuse to read UTF-16 without it.
>
> However, musl's iconv implementation writes UTF-16 without a BOM,
> relying on the user to interpret it as big-endian. This causes t0028 and
> the related functionality to fail, since Git won't read the file without
> a BOM.
>
> Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
>
> Update the tests to detect this case and write test data with an added
> BOM if necessary. Always write the BOM in the tests in big-endian
> format, since all iconv implementations that omit a BOM must use
> big-endian serialization according to the Unicode standard.
>
> Preserve the existing behavior for systems which do not have this knob
> enabled, since they may use optimized implementations, including
> defaulting to the native endianness, to gain improved performance, which
> can be significant with large checkouts.

Is the based on measurements on a real system ?

I think we agree that Git will write UTF-16 always as big endian with BOM,
following the tradition of iconv/libiconv.
If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
and reduce the maintenance burden a little bit, giving a simpler patch.

What do you think ?


diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..ef19c98e67 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -13,7 +13,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	printf "\376\377"                         >test.utf16.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16BE >>test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..d3731273be 100644
--- a/utf8.c
+++ b/utf8.c
@@ -564,7 +564,8 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_le_bom;
 		bom_len = sizeof(utf16_le_bom);
 		out_encoding = "UTF-16LE";
-	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding) ||
+		   same_utf_encoding("UTF-16", out_encoding)) {
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";


>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Makefile                         |  6 ++++++
>  t/t0028-working-tree-encoding.sh | 25 ++++++++++++++++++++++---
>  utf8.c                           | 10 ++++++++++
>  3 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 571160a2c4..b2a4765e5f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -259,6 +259,9 @@ all::
>  # Define OLD_ICONV if your library has an old iconv(), where the second
>  # (input buffer pointer) parameter is declared with type (const char **).
>  #
> +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32.
> +#
>  # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
>  #
>  # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
> @@ -1415,6 +1418,9 @@ ifndef NO_ICONV
>  		EXTLIBS += $(ICONV_LINK) -liconv
>  	endif
>  endif
> +ifdef ICONV_NEEDS_BOM
> +	BASIC_CFLAGS += -DICONV_NEEDS_BOM
> +endif
>  ifdef NEEDS_LIBGEN
>  	EXTLIBS += -lgen
>  endif
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index e58ecbfc44..bfc4a9d4dd 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
>
>  GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
>
> +test_lazy_prereq NO_UTF16_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> +	test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
> +	iconv -f UTF-8 -t UTF-16
> +
> +}
> +
> +write_utf32 () {
> +	test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
> +	iconv -f UTF-8 -t UTF-32
> +}
> +
>  test_expect_success 'setup test files' '
>  	git config core.eol lf &&
>
> @@ -13,8 +32,8 @@ test_expect_success 'setup test files' '
>  	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
>  	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
>  	printf "$text" >test.utf8.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> +	printf "$text" | write_utf16 >test.utf16.raw &&
> +	printf "$text" | write_utf32 >test.utf32.raw &&
>  	printf "\377\376"                         >test.utf16lebom.raw &&
>  	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
>
> @@ -223,7 +242,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>
>  	text="hallo there!\nroundtrip test here!" &&
>  	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> +	printf "$text" | write_utf16 >roundtrip.utf16 &&
>  	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>
>  	# SHIFT-JIS encoded files are round-trip checked by default...
> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..133199de0e 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
>  		bom_str = utf16_be_bom;
>  		bom_len = sizeof(utf16_be_bom);
>  		out_encoding = "UTF-16BE";
> +#ifdef ICONV_NEEDS_BOM
> +	} else if (same_utf_encoding("UTF-16", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";
> +	} else if (same_utf_encoding("UTF-32", out_encoding)) {
> +		bom_str = utf32_be_bom;
> +		bom_len = sizeof(utf32_be_bom);
> +		out_encoding = "UTF-32BE";
> +#endif
>  	}
>
>  	conv = iconv_open(out_encoding, in_encoding);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-10  1:45                     ` Eric Sunshine
@ 2019-02-10 18:14                       ` brian m. carlson
  0 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-10 18:14 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Lars Schneider, Rich Felker, Junio C Hamano,
	Kevin Daudt

[-- Attachment #1: Type: text/plain, Size: 1957 bytes --]

On Sat, Feb 09, 2019 at 08:45:16PM -0500, Eric Sunshine wrote:
> On Sat, Feb 9, 2019 at 3:08 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > [...]
> > Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> > iconv implementation has this behavior. When set, Git will write a BOM
> > manually for UTF-16 and UTF-32 and then force the data to be written in
> > UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> > tests use the raw "UTF-16" encoding, which will be big-endian when the
> > implementation requires this knob to be set.
> 
> The name ICONV_NEEDS_BOM makes it sound as if we must feed a BOM
> _into_ 'iconv', which is quite confusing since the actual intention is
> that 'iconv' doesn't emit a BOM and we need to make up for the
> deficiency. Using a name such as ICONV_OMITS_BOM or ICONV_NEGLECTS_BOM
> makes it somewhat clearer that there is some deficiency with which we
> need to deal.

That does sound like a better name.

> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> > ---
> > diff --git a/Makefile b/Makefile
> > @@ -259,6 +259,9 @@ all::
> > +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> > +# byte-order mark (BOM) when writing UTF-16 or UTF-32.
> 
> Not a big deal, but I wonder if it would be helpful to tack on "...,
> in which case it outputs big-endian unconditionally." or something.

Sure, I can add that.

> Stray blank line before the closing brace.

Will fix.

> It's probably doesn't matter much with these two tiny functions, but I
> was wondering if it would make sense to maintain the &&-chain, perhaps
> like this:
> 
>     if test test_have_prereq NO_UTF32_BOM
>     then
>         printf '\x00\x00\xfe\xff'
>     fi &&
>     iconv -f UTF-8 -t UTF-32

Yeah, that sounds like a good idea.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-10  8:04                     ` Torsten Bögershausen
@ 2019-02-10 18:55                       ` brian m. carlson
  2019-02-11 17:14                         ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-10 18:55 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: git, larsxschneider, Rich Felker, Junio C Hamano, Kevin Daudt

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

On Sun, Feb 10, 2019 at 08:04:13AM +0000, Torsten Bögershausen wrote:
> On Sat, Feb 09, 2019 at 08:08:01PM +0000, brian m. carlson wrote:
> > Preserve the existing behavior for systems which do not have this knob
> > enabled, since they may use optimized implementations, including
> > defaulting to the native endianness, to gain improved performance, which
> > can be significant with large checkouts.
> 
> Is the based on measurements on a real system ?

No, I haven't done any performance measurements. However, swapping bytes
is a (IIRC 1-cycle) instruction on x86, which would be executed for each
iteration of the loop. My intuition tells me that will be a significant
expense when there are a lot of files, but I can omit that phrase since
I haven't measured.

> I think we agree that Git will write UTF-16 always as big endian with BOM,
> following the tradition of iconv/libiconv.
> If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
> and reduce the maintenance burden a little bit, giving a simpler patch.

No, I don't think it will. libiconv will always write big-endian, but
glibc has a separate iconv implementation which writes the native
endianness. (I believe FreeBSD's does the same thing as glibc's.) I
think it's useful for us to know that we can handle UTF-16 using the
system behavior where possible, since that's what the system is going to
produce.

> What do you think ?

While I like the simplicity of the approach, as I mentioned above, and I
did consider this originally, I'd rather test the behavior of the system
we're operating on, provided it's suitable for our needs.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
  2019-02-10  1:45                     ` Eric Sunshine
  2019-02-10  8:04                     ` Torsten Bögershausen
@ 2019-02-11  0:23                     ` brian m. carlson
  2019-02-11  1:16                       ` Eric Sunshine
  2019-02-11  1:26                     ` [PATCH v3] " brian m. carlson
  2019-02-12  0:52                     ` [PATCH v4] " brian m. carlson
  4 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11  0:23 UTC (permalink / raw)
  To: git
  Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
	Torsten Bögershausen, Rich Felker, Kevin Daudt

When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.

Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.

Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.

Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.

Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile                         |  6 ++++++
 t/t0028-working-tree-encoding.sh | 25 ++++++++++++++++++++++---
 utf8.c                           | 10 ++++++++++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 571160a2c4..b2a4765e5f 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,9 @@ all::
 # Define OLD_ICONV if your library has an old iconv(), where the second
 # (input buffer pointer) parameter is declared with type (const char **).
 #
+# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32.
+#
 # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
 #
 # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1418,9 @@ ifndef NO_ICONV
 		EXTLIBS += $(ICONV_LINK) -liconv
 	endif
 endif
+ifdef ICONV_NEEDS_BOM
+	BASIC_CFLAGS += -DICONV_NEEDS_BOM
+endif
 ifdef NEEDS_LIBGEN
 	EXTLIBS += -lgen
 endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..bfc4a9d4dd 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_UTF16_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+	test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
+	iconv -f UTF-8 -t UTF-16
+
+}
+
+write_utf32 () {
+	test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +32,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
@@ -223,7 +242,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
 
 	text="hallo there!\nroundtrip test here!" &&
 	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	printf "$text" | write_utf16 >roundtrip.utf16 &&
 	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
 
 	# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..133199de0e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+#ifdef ICONV_NEEDS_BOM
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-32", out_encoding)) {
+		bom_str = utf32_be_bom;
+		bom_len = sizeof(utf32_be_bom);
+		out_encoding = "UTF-32BE";
+#endif
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
  2019-02-11  0:23                     ` [PATCH v2] " brian m. carlson
@ 2019-02-11  1:16                       ` Eric Sunshine
  2019-02-11  1:20                         ` brian m. carlson
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Sunshine @ 2019-02-11  1:16 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Git List, Lars Schneider, Junio C Hamano,
	Torsten Bögershausen, Rich Felker, Kevin Daudt

On Sun, Feb 10, 2019 at 7:23 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
> [...]
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Premature git-send-email invocation? The commit message of v2 seems to
be a bit different from v1, but the patch itself is identical.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] utf8: handle systems that don't write BOM for UTF-16
  2019-02-11  1:16                       ` Eric Sunshine
@ 2019-02-11  1:20                         ` brian m. carlson
  0 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-11  1:20 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Lars Schneider, Junio C Hamano,
	Torsten Bögershausen, Rich Felker, Kevin Daudt

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

On Sun, Feb 10, 2019 at 08:16:26PM -0500, Eric Sunshine wrote:
> On Sun, Feb 10, 2019 at 7:23 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > When serializing UTF-16 (and UTF-32), there are three possible ways to
> > write the stream. One can write the data with a BOM in either big-endian
> > or little-endian format, or one can write the data without a BOM in
> > big-endian format.
> > [...]
> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> 
> Premature git-send-email invocation? The commit message of v2 seems to
> be a bit different from v1, but the patch itself is identical.

Oof, I forgot to run "git add -u" before running "git commit --amend".
Thanks for catching this; I'll send out a v3 in a second.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
                                       ` (2 preceding siblings ...)
  2019-02-11  0:23                     ` [PATCH v2] " brian m. carlson
@ 2019-02-11  1:26                     ` brian m. carlson
  2019-02-11 21:43                       ` Kevin Daudt
  2019-02-12  0:52                     ` [PATCH v4] " brian m. carlson
  4 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11  1:26 UTC (permalink / raw)
  To: git
  Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
	Torsten Bögershausen, Rich Felker, Kevin Daudt

When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.

Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.

Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.

Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.

Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile                         |  7 +++++++
 t/t0028-working-tree-encoding.sh | 30 +++++++++++++++++++++++++++---
 utf8.c                           | 10 ++++++++++
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/Makefile b/Makefile
index 571160a2c4..c6172450af 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,10 @@ all::
 # Define OLD_ICONV if your library has an old iconv(), where the second
 # (input buffer pointer) parameter is declared with type (const char **).
 #
+# Define ICONV_OMITS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
+# big-endian format.
+#
 # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
 #
 # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1419,9 @@ ifndef NO_ICONV
 		EXTLIBS += $(ICONV_LINK) -liconv
 	endif
 endif
+ifdef ICONV_OMITS_BOM
+	BASIC_CFLAGS += -DICONV_OMITS_BOM
+endif
 ifdef NEEDS_LIBGEN
 	EXTLIBS += -lgen
 endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..8936ba6757 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_UTF16_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+	if test_have_prereq NO_UTF16_BOM
+	then
+		printf '\xfe\xff'
+	fi &&
+	iconv -f UTF-8 -t UTF-16
+}
+
+write_utf32 () {
+	if test_have_prereq NO_UTF32_BOM
+	then
+		printf '\x00\x00\xfe\xff'
+	fi &&
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
@@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
 
 	text="hallo there!\nroundtrip test here!" &&
 	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	printf "$text" | write_utf16 >roundtrip.utf16 &&
 	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
 
 	# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..5d9a917bc8 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+#ifdef ICONV_OMITS_BOM
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-32", out_encoding)) {
+		bom_str = utf32_be_bom;
+		bom_len = sizeof(utf32_be_bom);
+		out_encoding = "UTF-32BE";
+#endif
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
  2019-02-10 18:55                       ` brian m. carlson
@ 2019-02-11 17:14                         ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2019-02-11 17:14 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Torsten Bögershausen, git, larsxschneider, Rich Felker,
	Kevin Daudt

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On Sun, Feb 10, 2019 at 08:04:13AM +0000, Torsten Bögershausen wrote:
>
>> I think we agree that Git will write UTF-16 always as big endian with BOM,
>> following the tradition of iconv/libiconv.
>> If yes, we can reduce the lines of code/#idefs somewhat, have the knob always on,
>> and reduce the maintenance burden a little bit, giving a simpler patch.
>
> No, I don't think it will. libiconv will always write big-endian, but
> glibc has a separate iconv implementation which writes the native
> endianness. (I believe FreeBSD's does the same thing as glibc's.) I
> think it's useful for us to know that we can handle UTF-16 using the
> system behavior where possible, since that's what the system is going to
> produce.
>
>> What do you think ?
>
> While I like the simplicity of the approach, as I mentioned above, and I
> did consider this originally, I'd rather test the behavior of the system
> we're operating on, provided it's suitable for our needs.

I see both sides of the argument, and each has its merit.

Let's go with the "follow the platform" and make sure the decision
is documented somewhere in the resulting code.

Thanks, all.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-11  1:26                     ` [PATCH v3] " brian m. carlson
@ 2019-02-11 21:43                       ` Kevin Daudt
  2019-02-11 23:58                         ` brian m. carlson
  0 siblings, 1 reply; 30+ messages in thread
From: Kevin Daudt @ 2019-02-11 21:43 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Lars Schneider, Junio C Hamano, Eric Sunshine,
	Torsten Bögershausen, Rich Felker

On Mon, Feb 11, 2019 at 01:26:39AM +0000, brian m. carlson wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
> 
> Most systems' iconv implementations choose to write it with a BOM in
> some endianness, since this is the most foolproof, and it is resistant
> to misinterpretation on Windows, where UTF-16 and the little-endian
> serialization are very common. For compatibility with Windows and to
> avoid accidental misuse there, Git always wants to write UTF-16 with a
> BOM, and will refuse to read UTF-16 without it.
> 
> However, musl's iconv implementation writes UTF-16 without a BOM,
> relying on the user to interpret it as big-endian. This causes t0028 and
> the related functionality to fail, since Git won't read the file without
> a BOM.
> 
> Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
> 
> Update the tests to detect this case and write test data with an added
> BOM if necessary. Always write the BOM in the tests in big-endian
> format, since all iconv implementations that omit a BOM must use
> big-endian serialization according to the Unicode standard.
> 
> Preserve the existing behavior for systems which do not have this knob
> enabled, since they may use optimized implementations, including
> defaulting to the native endianness, which may improve performance.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Makefile                         |  7 +++++++
>  t/t0028-working-tree-encoding.sh | 30 +++++++++++++++++++++++++++---
>  utf8.c                           | 10 ++++++++++
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 571160a2c4..c6172450af 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -259,6 +259,10 @@ all::
>  # Define OLD_ICONV if your library has an old iconv(), where the second
>  # (input buffer pointer) parameter is declared with type (const char **).
>  #
> +# Define ICONV_OMITS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
> +# big-endian format.
> +#
>  # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
>  #
>  # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
> @@ -1415,6 +1419,9 @@ ifndef NO_ICONV
>  		EXTLIBS += $(ICONV_LINK) -liconv
>  	endif
>  endif
> +ifdef ICONV_OMITS_BOM
> +	BASIC_CFLAGS += -DICONV_OMITS_BOM
> +endif
>  ifdef NEEDS_LIBGEN
>  	EXTLIBS += -lgen
>  endif
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index e58ecbfc44..8936ba6757 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
>  
>  GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
>  
> +test_lazy_prereq NO_UTF16_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> +	if test_have_prereq NO_UTF16_BOM
> +	then
> +		printf '\xfe\xff'
> +	fi &&
> +	iconv -f UTF-8 -t UTF-16
> +}
> +
> +write_utf32 () {
> +	if test_have_prereq NO_UTF32_BOM
> +	then
> +		printf '\x00\x00\xfe\xff'
> +	fi &&
> +	iconv -f UTF-8 -t UTF-32
> +}
> +
>  test_expect_success 'setup test files' '
>  	git config core.eol lf &&
>  
> @@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
>  	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
>  	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
>  	printf "$text" >test.utf8.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> +	printf "$text" | write_utf16 >test.utf16.raw &&
> +	printf "$text" | write_utf32 >test.utf32.raw &&
>  	printf "\377\376"                         >test.utf16lebom.raw &&
>  	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
>  
> @@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>  
>  	text="hallo there!\nroundtrip test here!" &&
>  	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> +	printf "$text" | write_utf16 >roundtrip.utf16 &&
>  	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>  
>  	# SHIFT-JIS encoded files are round-trip checked by default...
> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..5d9a917bc8 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
>  		bom_str = utf16_be_bom;
>  		bom_len = sizeof(utf16_be_bom);
>  		out_encoding = "UTF-16BE";
> +#ifdef ICONV_OMITS_BOM
> +	} else if (same_utf_encoding("UTF-16", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";
> +	} else if (same_utf_encoding("UTF-32", out_encoding)) {
> +		bom_str = utf32_be_bom;
> +		bom_len = sizeof(utf32_be_bom);
> +		out_encoding = "UTF-32BE";
> +#endif
>  	}
>  
>  	conv = iconv_open(out_encoding, in_encoding);

With some additional fixes, this indeed does solve the issue for Alpine
Linux, thanks.

I had to fix the following as well:

iff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 8936ba6757..8491f216aa 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -148,8 +148,8 @@ do
        test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
        test_when_finished "git reset --hard HEAD^" &&

-       cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
-       cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+       cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
+       cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
        cp crlf.utf${i}.raw eol.utf${i} &&

        cat >expectIndexLF <<-EOF &&

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-11 21:43                       ` Kevin Daudt
@ 2019-02-11 23:58                         ` brian m. carlson
  2019-02-12  0:31                           ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-11 23:58 UTC (permalink / raw)
  To: Kevin Daudt, git, Lars Schneider, Junio C Hamano, Eric Sunshine,
	Torsten Bögershausen, Rich Felker

[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]

On Mon, Feb 11, 2019 at 10:43:06PM +0100, Kevin Daudt wrote:
> With some additional fixes, this indeed does solve the issue for Alpine
> Linux, thanks.
> 
> I had to fix the following as well:
> 
> iff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index 8936ba6757..8491f216aa 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -148,8 +148,8 @@ do
>         test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
>         test_when_finished "git reset --hard HEAD^" &&
> 
> -       cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
> -       cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
> +       cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
> +       cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
>         cp crlf.utf${i}.raw eol.utf${i} &&
> 
>         cat >expectIndexLF <<-EOF &&

I'll squash in this fix, thanks.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-11 23:58                         ` brian m. carlson
@ 2019-02-12  0:31                           ` Junio C Hamano
  2019-02-12  0:53                             ` brian m. carlson
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2019-02-12  0:31 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
	Torsten Bögershausen, Rich Felker

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> -       cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
>> -       cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
>> +       cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
>> +       cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
>>         cp crlf.utf${i}.raw eol.utf${i} &&
>> 
>>         cat >expectIndexLF <<-EOF &&
>
> I'll squash in this fix, thanks.

Thanks, all.  In the meantime, what I've pushed out has this
applied immediately on top.  Unless there is anything else, I could
squash it in in my next pushout I plan to do tonight, before getting
ready to tag -rc1 tomorrow.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v4] utf8: handle systems that don't write BOM for UTF-16
  2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
                                       ` (3 preceding siblings ...)
  2019-02-11  1:26                     ` [PATCH v3] " brian m. carlson
@ 2019-02-12  0:52                     ` brian m. carlson
  4 siblings, 0 replies; 30+ messages in thread
From: brian m. carlson @ 2019-02-12  0:52 UTC (permalink / raw)
  To: git
  Cc: Lars Schneider, Junio C Hamano, Eric Sunshine,
	Torsten Bögershausen, Rich Felker, Kevin Daudt

When serializing UTF-16 (and UTF-32), there are three possible ways to
write the stream. One can write the data with a BOM in either big-endian
or little-endian format, or one can write the data without a BOM in
big-endian format.

Most systems' iconv implementations choose to write it with a BOM in
some endianness, since this is the most foolproof, and it is resistant
to misinterpretation on Windows, where UTF-16 and the little-endian
serialization are very common. For compatibility with Windows and to
avoid accidental misuse there, Git always wants to write UTF-16 with a
BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM,
relying on the user to interpret it as big-endian. This causes t0028 and
the related functionality to fail, since Git won't read the file without
a BOM.

Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
iconv implementation has this behavior. When set, Git will write a BOM
manually for UTF-16 and UTF-32 and then force the data to be written in
UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
tests use the raw "UTF-16" encoding, which will be big-endian when the
implementation requires this knob to be set.

Update the tests to detect this case and write test data with an added
BOM if necessary. Always write the BOM in the tests in big-endian
format, since all iconv implementations that omit a BOM must use
big-endian serialization according to the Unicode standard.

Preserve the existing behavior for systems which do not have this knob
enabled, since they may use optimized implementations, including
defaulting to the native endianness, which may improve performance.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Makefile                         |  7 +++++++
 t/t0028-working-tree-encoding.sh | 34 +++++++++++++++++++++++++++-----
 utf8.c                           | 14 +++++++++++++
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index 571160a2c4..c6172450af 100644
--- a/Makefile
+++ b/Makefile
@@ -259,6 +259,10 @@ all::
 # Define OLD_ICONV if your library has an old iconv(), where the second
 # (input buffer pointer) parameter is declared with type (const char **).
 #
+# Define ICONV_OMITS_BOM if your iconv implementation does not write a
+# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
+# big-endian format.
+#
 # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
 #
 # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
@@ -1415,6 +1419,9 @@ ifndef NO_ICONV
 		EXTLIBS += $(ICONV_LINK) -liconv
 	endif
 endif
+ifdef ICONV_OMITS_BOM
+	BASIC_CFLAGS += -DICONV_OMITS_BOM
+endif
 ifdef NEEDS_LIBGEN
 	EXTLIBS += -lgen
 endif
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..500229a9bd 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_UTF16_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
+'
+
+test_lazy_prereq NO_UTF32_BOM '
+	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
+'
+
+write_utf16 () {
+	if test_have_prereq NO_UTF16_BOM
+	then
+		printf '\xfe\xff'
+	fi &&
+	iconv -f UTF-8 -t UTF-16
+}
+
+write_utf32 () {
+	if test_have_prereq NO_UTF32_BOM
+	then
+		printf '\x00\x00\xfe\xff'
+	fi &&
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
@@ -124,8 +148,8 @@ do
 		test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
 		test_when_finished "git reset --hard HEAD^" &&
 
-		cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
-		cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+		cat lf.utf8.raw | write_utf${i} >lf.utf${i}.raw &&
+		cat crlf.utf8.raw | write_utf${i} >crlf.utf${i}.raw &&
 		cp crlf.utf${i}.raw eol.utf${i} &&
 
 		cat >expectIndexLF <<-EOF &&
@@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
 
 	text="hallo there!\nroundtrip test here!" &&
 	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	printf "$text" | write_utf16 >roundtrip.utf16 &&
 	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
 
 	# SHIFT-JIS encoded files are round-trip checked by default...
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..3b42fadffd 100644
--- a/utf8.c
+++ b/utf8.c
@@ -559,6 +559,10 @@ char *reencode_string_len(const char *in, size_t insz,
 	/*
 	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
 	 * Some users under Windows want the little endian version
+	 *
+	 * We handle UTF-16 and UTF-32 ourselves only if the platform does not
+	 * provide a BOM (which we require), since we want to match the behavior
+	 * of the system tools and libc as much as possible.
 	 */
 	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
 		bom_str = utf16_le_bom;
@@ -568,6 +572,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+#ifdef ICONV_OMITS_BOM
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-32", out_encoding)) {
+		bom_str = utf32_be_bom;
+		bom_len = sizeof(utf32_be_bom);
+		out_encoding = "UTF-32BE";
+#endif
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-12  0:31                           ` Junio C Hamano
@ 2019-02-12  0:53                             ` brian m. carlson
  2019-02-12  2:43                               ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-12  0:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
	Torsten Bögershausen, Rich Felker

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

On Mon, Feb 11, 2019 at 04:31:00PM -0800, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >> -       cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
> >> -       cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
> >> +       cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
> >> +       cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
> >>         cp crlf.utf${i}.raw eol.utf${i} &&
> >> 
> >>         cat >expectIndexLF <<-EOF &&
> >
> > I'll squash in this fix, thanks.
> 
> Thanks, all.  In the meantime, what I've pushed out has this
> applied immediately on top.  Unless there is anything else, I could
> squash it in in my next pushout I plan to do tonight, before getting
> ready to tag -rc1 tomorrow.

I've just sent a v4 with this squashed in. Whether you want to pick that
up or squash this into v3 is up to you.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
  2019-02-12  0:53                             ` brian m. carlson
@ 2019-02-12  2:43                               ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2019-02-12  2:43 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Kevin Daudt, git, Lars Schneider, Eric Sunshine,
	Torsten Bögershausen, Rich Felker

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I've just sent a v4 with this squashed in. Whether you want to pick that
> up or squash this into v3 is up to you.

Let's take yours, as there is no point doing an eval for these two;
for that matter, braces around ${i} are also pointless, but I'll let
them pass.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-02-12  2:43 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-07 21:59 t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Kevin Daudt
2019-02-08  0:17 ` brian m. carlson
2019-02-08  6:04   ` Rich Felker
2019-02-08 11:45     ` brian m. carlson
2019-02-08 11:55       ` Kevin Daudt
2019-02-08 13:51         ` brian m. carlson
2019-02-08 17:50           ` Junio C Hamano
2019-02-08 20:23             ` Kevin Daudt
2019-02-08 20:42               ` brian m. carlson
2019-02-08 23:12                 ` Junio C Hamano
2019-02-09  0:24                   ` brian m. carlson
2019-02-09 14:57                 ` Kevin Daudt
2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
2019-02-10  1:45                     ` Eric Sunshine
2019-02-10 18:14                       ` brian m. carlson
2019-02-10  8:04                     ` Torsten Bögershausen
2019-02-10 18:55                       ` brian m. carlson
2019-02-11 17:14                         ` Junio C Hamano
2019-02-11  0:23                     ` [PATCH v2] " brian m. carlson
2019-02-11  1:16                       ` Eric Sunshine
2019-02-11  1:20                         ` brian m. carlson
2019-02-11  1:26                     ` [PATCH v3] " brian m. carlson
2019-02-11 21:43                       ` Kevin Daudt
2019-02-11 23:58                         ` brian m. carlson
2019-02-12  0:31                           ` Junio C Hamano
2019-02-12  0:53                             ` brian m. carlson
2019-02-12  2:43                               ` Junio C Hamano
2019-02-12  0:52                     ` [PATCH v4] " brian m. carlson
2019-02-08 16:13         ` t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Rich Felker
2019-02-09  8:09     ` Torsten Bögershausen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).