On Fri, Feb 08, 2019 at 09:23:36PM +0100, Kevin Daudt wrote: > Firstly, the tests expect iconv -t UTF-16 to output a BOM, which it > indeed does not do on Alpine. Secondly, git itself also expects the BOM > to be present when the encoding is set to UTF-16, otherwise it will > complain. Yeah, we definitely want to require a BOM for UTF-16. As previously mentioned, it isn't safe for us to assume big-endian when it's missing. > I tried change the test to manually inject a BOM to the file (and > setting iconv to UTF-16LE / UTF16-BE, which lets the first test go > through, but test 3 then fails, because git itself output the file > without BOM, presumably because it's passed through iconv. > > So I'm not sure if it's a matter of just fixing the tests. I think something like the following will likely work in this scenario: ------ %< --------- From: "brian m. carlson" Date: Fri, 8 Feb 2019 12:58:11 +0000 Subject: [PATCH] WIP: utf8: handle missing musl UTF-16 BOM Signed-off-by: brian m. carlson --- t/t0028-working-tree-encoding.sh | 20 ++++++++++++++++++-- utf8.c | 4 ++++ 2 files changed, 22 insertions(+), 2 deletions(-) diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index e58ecbfc44..ff02d03bad 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -6,6 +6,22 @@ test_description='working-tree-encoding conversion via gitattributes' GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING +test_lazy_prereq NO_BOM ' + printf abc | iconv -f UTF-8 -t UTF-16 && + test $(wc -c) = 6 +' + +write_utf16 () { + test_have_prereq NO_BOM && printf '\xfe\xff' + iconv -f UTF-8 -t UTF-16 + +} + +write_utf32 () { + test_have_prereq NO_BOM && printf '\x00\x00\xfe\xff' + iconv -f UTF-8 -t UTF-32 +} + test_expect_success 'setup test files' ' git config core.eol lf && @@ -13,8 +29,8 @@ test_expect_success 'setup test files' ' echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes && echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes && printf "$text" >test.utf8.raw && - printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw && - printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw && + printf "$text" | write_utf16 >test.utf16.raw && + printf "$text" | write_utf32 >test.utf32.raw && printf "\377\376" >test.utf16lebom.raw && printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw && diff --git a/utf8.c b/utf8.c index 83824dc2f4..4aa69cd65b 100644 --- a/utf8.c +++ b/utf8.c @@ -568,6 +568,10 @@ char *reencode_string_len(const char *in, size_t insz, bom_str = utf16_be_bom; bom_len = sizeof(utf16_be_bom); out_encoding = "UTF-16BE"; + } else if (same_utf_encoding("UTF-16", out_encoding)) { + bom_str = utf16_le_bom; + bom_len = sizeof(utf16_le_bom); + out_encoding = "UTF-16LE"; } conv = iconv_open(out_encoding, in_encoding); ------ %< --------- This passes for me on glibc, but only on a little-endian system. If this works for musl folks, then I'll add a config option for those people who have UTF-16 without BOM. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204