On Fri, Feb 08, 2019 at 09:23:36PM +0100, Kevin Daudt wrote:
> Firstly, the tests expect iconv -t UTF-16 to output a BOM, which it
> indeed does not do on Alpine. Secondly, git itself also expects the BOM
> to be present when the encoding is set to UTF-16, otherwise it will
> complain.

Yeah, we definitely want to require a BOM for UTF-16. As previously
mentioned, it isn't safe for us to assume big-endian when it's missing.

> I tried change the test to manually inject a BOM to the file (and
> setting iconv to UTF-16LE / UTF16-BE, which lets the first test go
> through, but test 3 then fails, because git itself output the file
> without BOM, presumably because it's passed through iconv.
> 
> So I'm not sure if it's a matter of just fixing the tests.

I think something like the following will likely work in this scenario:

------ %< ---------
From: "brian m. carlson" <sandals@crustytoothpaste.net>
Date: Fri, 8 Feb 2019 12:58:11 +0000
Subject: [PATCH] WIP: utf8: handle missing musl UTF-16 BOM

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 t/t0028-working-tree-encoding.sh | 20 ++++++++++++++++++--
 utf8.c                           |  4 ++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e58ecbfc44..ff02d03bad 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -6,6 +6,22 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
 
+test_lazy_prereq NO_BOM '
+	printf abc | iconv -f UTF-8 -t UTF-16 &&
+	test $(wc -c) = 6
+'
+
+write_utf16 () {
+	test_have_prereq NO_BOM && printf '\xfe\xff'
+	iconv -f UTF-8 -t UTF-16
+
+}
+
+write_utf32 () {
+	test_have_prereq NO_BOM && printf '\x00\x00\xfe\xff'
+	iconv -f UTF-8 -t UTF-32
+}
+
 test_expect_success 'setup test files' '
 	git config core.eol lf &&
 
@@ -13,8 +29,8 @@ test_expect_success 'setup test files' '
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
 	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
-	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "$text" | write_utf16 >test.utf16.raw &&
+	printf "$text" | write_utf32 >test.utf32.raw &&
 	printf "\377\376"                         >test.utf16lebom.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
diff --git a/utf8.c b/utf8.c
index 83824dc2f4..4aa69cd65b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -568,6 +568,10 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);
------ %< ---------

This passes for me on glibc, but only on a little-endian system. If this
works for musl folks, then I'll add a config option for those people who
have UTF-16 without BOM.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204