git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Kevin Daudt <me@ikke.info>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org, "Lars Schneider" <larsxschneider@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Rich Felker" <dalias@libc.org>
Subject: Re: [PATCH v3] utf8: handle systems that don't write BOM for UTF-16
Date: Mon, 11 Feb 2019 22:43:06 +0100	[thread overview]
Message-ID: <20190211214306.GB14229@alpha> (raw)
In-Reply-To: <20190211012639.579489-1-sandals@crustytoothpaste.net>

On Mon, Feb 11, 2019 at 01:26:39AM +0000, brian m. carlson wrote:
> When serializing UTF-16 (and UTF-32), there are three possible ways to
> write the stream. One can write the data with a BOM in either big-endian
> or little-endian format, or one can write the data without a BOM in
> big-endian format.
> 
> Most systems' iconv implementations choose to write it with a BOM in
> some endianness, since this is the most foolproof, and it is resistant
> to misinterpretation on Windows, where UTF-16 and the little-endian
> serialization are very common. For compatibility with Windows and to
> avoid accidental misuse there, Git always wants to write UTF-16 with a
> BOM, and will refuse to read UTF-16 without it.
> 
> However, musl's iconv implementation writes UTF-16 without a BOM,
> relying on the user to interpret it as big-endian. This causes t0028 and
> the related functionality to fail, since Git won't read the file without
> a BOM.
> 
> Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.
> 
> Update the tests to detect this case and write test data with an added
> BOM if necessary. Always write the BOM in the tests in big-endian
> format, since all iconv implementations that omit a BOM must use
> big-endian serialization according to the Unicode standard.
> 
> Preserve the existing behavior for systems which do not have this knob
> enabled, since they may use optimized implementations, including
> defaulting to the native endianness, which may improve performance.
> 
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Makefile                         |  7 +++++++
>  t/t0028-working-tree-encoding.sh | 30 +++++++++++++++++++++++++++---
>  utf8.c                           | 10 ++++++++++
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 571160a2c4..c6172450af 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -259,6 +259,10 @@ all::
>  # Define OLD_ICONV if your library has an old iconv(), where the second
>  # (input buffer pointer) parameter is declared with type (const char **).
>  #
> +# Define ICONV_OMITS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
> +# big-endian format.
> +#
>  # Define NO_DEFLATE_BOUND if your zlib does not have deflateBound.
>  #
>  # Define NO_R_TO_GCC_LINKER if your gcc does not like "-R/path/lib"
> @@ -1415,6 +1419,9 @@ ifndef NO_ICONV
>  		EXTLIBS += $(ICONV_LINK) -liconv
>  	endif
>  endif
> +ifdef ICONV_OMITS_BOM
> +	BASIC_CFLAGS += -DICONV_OMITS_BOM
> +endif
>  ifdef NEEDS_LIBGEN
>  	EXTLIBS += -lgen
>  endif
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index e58ecbfc44..8936ba6757 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,30 @@ test_description='working-tree-encoding conversion via gitattributes'
>  
>  GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
>  
> +test_lazy_prereq NO_UTF16_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> +	test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> +	if test_have_prereq NO_UTF16_BOM
> +	then
> +		printf '\xfe\xff'
> +	fi &&
> +	iconv -f UTF-8 -t UTF-16
> +}
> +
> +write_utf32 () {
> +	if test_have_prereq NO_UTF32_BOM
> +	then
> +		printf '\x00\x00\xfe\xff'
> +	fi &&
> +	iconv -f UTF-8 -t UTF-32
> +}
> +
>  test_expect_success 'setup test files' '
>  	git config core.eol lf &&
>  
> @@ -13,8 +37,8 @@ test_expect_success 'setup test files' '
>  	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
>  	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
>  	printf "$text" >test.utf8.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> +	printf "$text" | write_utf16 >test.utf16.raw &&
> +	printf "$text" | write_utf32 >test.utf32.raw &&
>  	printf "\377\376"                         >test.utf16lebom.raw &&
>  	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
>  
> @@ -223,7 +247,7 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>  
>  	text="hallo there!\nroundtrip test here!" &&
>  	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> -	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> +	printf "$text" | write_utf16 >roundtrip.utf16 &&
>  	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
>  
>  	# SHIFT-JIS encoded files are round-trip checked by default...
> diff --git a/utf8.c b/utf8.c
> index 83824dc2f4..5d9a917bc8 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -568,6 +568,16 @@ char *reencode_string_len(const char *in, size_t insz,
>  		bom_str = utf16_be_bom;
>  		bom_len = sizeof(utf16_be_bom);
>  		out_encoding = "UTF-16BE";
> +#ifdef ICONV_OMITS_BOM
> +	} else if (same_utf_encoding("UTF-16", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";
> +	} else if (same_utf_encoding("UTF-32", out_encoding)) {
> +		bom_str = utf32_be_bom;
> +		bom_len = sizeof(utf32_be_bom);
> +		out_encoding = "UTF-32BE";
> +#endif
>  	}
>  
>  	conv = iconv_open(out_encoding, in_encoding);

With some additional fixes, this indeed does solve the issue for Alpine
Linux, thanks.

I had to fix the following as well:

iff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 8936ba6757..8491f216aa 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -148,8 +148,8 @@ do
        test_when_finished "rm -f crlf.utf${i}.raw lf.utf${i}.raw" &&
        test_when_finished "git reset --hard HEAD^" &&

-       cat lf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >lf.utf${i}.raw &&
-       cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-${i} >crlf.utf${i}.raw &&
+       cat lf.utf8.raw | eval "write_utf${i}" >lf.utf${i}.raw &&
+       cat crlf.utf8.raw | eval "write_utf${i}" >crlf.utf${i}.raw &&
        cp crlf.utf${i}.raw eol.utf${i} &&

        cat >expectIndexLF <<-EOF &&

  reply	other threads:[~2019-02-11 21:43 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-07 21:59 t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Kevin Daudt
2019-02-08  0:17 ` brian m. carlson
2019-02-08  6:04   ` Rich Felker
2019-02-08 11:45     ` brian m. carlson
2019-02-08 11:55       ` Kevin Daudt
2019-02-08 13:51         ` brian m. carlson
2019-02-08 17:50           ` Junio C Hamano
2019-02-08 20:23             ` Kevin Daudt
2019-02-08 20:42               ` brian m. carlson
2019-02-08 23:12                 ` Junio C Hamano
2019-02-09  0:24                   ` brian m. carlson
2019-02-09 14:57                 ` Kevin Daudt
2019-02-09 20:08                   ` [PATCH] utf8: handle systems that don't write BOM for UTF-16 brian m. carlson
2019-02-10  1:45                     ` Eric Sunshine
2019-02-10 18:14                       ` brian m. carlson
2019-02-10  8:04                     ` Torsten Bögershausen
2019-02-10 18:55                       ` brian m. carlson
2019-02-11 17:14                         ` Junio C Hamano
2019-02-11  0:23                     ` [PATCH v2] " brian m. carlson
2019-02-11  1:16                       ` Eric Sunshine
2019-02-11  1:20                         ` brian m. carlson
2019-02-11  1:26                     ` [PATCH v3] " brian m. carlson
2019-02-11 21:43                       ` Kevin Daudt [this message]
2019-02-11 23:58                         ` brian m. carlson
2019-02-12  0:31                           ` Junio C Hamano
2019-02-12  0:53                             ` brian m. carlson
2019-02-12  2:43                               ` Junio C Hamano
2019-02-12  0:52                     ` [PATCH v4] " brian m. carlson
2019-02-08 16:13         ` t0028-working-tree-encoding.sh failing on musl based systems (Alpine Linux) Rich Felker
2019-02-09  8:09     ` Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190211214306.GB14229@alpha \
    --to=me@ikke.info \
    --cc=dalias@libc.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=larsxschneider@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=sunshine@sunshineco.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).