git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: tboegi@web.de
Cc: git@vger.kernel.org, adrigibal@gmail.com
Subject: Re: [PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM"
Date: Tue, 22 Jan 2019 12:13:48 -0800	[thread overview]
Message-ID: <xmqqk1iwfo1v.fsf@gitster-ct.c.googlers.com> (raw)
In-Reply-To: <20190120164327.3234-1-tboegi@web.de> (tboegi's message of "Sun, 20 Jan 2019 17:43:27 +0100")

tboegi@web.de writes:

> The unicode standard itself defines 3 possible ways how to encode UTF-16.
> a) UTF-16, without BOM, big endian:
> b) UTF-16, with BOM, little endian:
> c) UTF-16, with BOM, big endian:

Is it OK to interpret "possible" as "allowed" above?

> iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
>
> d) UTF-16
> $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> 0000000  376 377  \0   g  \0   i  \0   t

So among three, encoder can only do "big endian with BOM" (c).

Lack of (a) "big endian without BOM" in the encoder is not a problem
in practice, as you can ask UTF-16BE to produce the stream, tell the
decoder that you have UTF-16 and the lack of the BOM would make the
decoder take it as (a).

But lack of (b) "little endian with BOM" is a problem.

So the proposal is to invent UTF-16-[BL]E-BOM that prepends BOM in
front of UTF-16-[BL]E output to allow those who want (b).

Which makes sense, I guess.  I do find it a bit ugly in the sense
that it is something iconv should learn to do, as the issue is
shared with all applications that want to use libiconv and convert
into UTF-16.

Do you add UTF-16-BE-BOM for consistency?  It would be identical to
telling iconv to encode to UTF-16, if I understood your problem
description correctly.

> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index b8392fc330..4a88ab8be7 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
>  ------------------------
>
>  Use the following attributes if your '*.ps1' files are UTF-16 little
> -endian encoded without BOM and you want Git to use Windows line endings
> +endian encoded with BOM and you want Git to use Windows line endings
>  in the working directory. Please note, it is highly recommended to
>  explicitly define the line endings with `eol` if the `working-tree-encoding`
>  attribute is used to avoid ambiguity.
>
>  ------------------------
> -*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
> +*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
>  ------------------------

This change is robbing from those who do want a file without BOM to
give to those who do want a file with BOM.  Are the latter class of
people the majority of the intended readers (read: Windows folks)?

I wonder if the following, instead of the above hunk, would work better:

 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory. Please note, it is highly recommended to
+in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+you want UTF-16 little endian with BOM).
+Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`

> @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
>  {
>  	iconv_t conv;
>  	char *out;
> +	const char *bom_str = NULL;
> +	size_t bom_len = 0;
>
>  	if (!in_encoding)
>  		return NULL;
>
> +	/* UTF-16LE-BOM is the same as UTF-16 for reading */
> +	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
> +		in_encoding = "UTF-16";
> +
> +	/*
> +	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
> +	 * Some users under Windows want the little endian version
> +	 */
> +	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
> +		bom_str = utf16_le_bom;
> +		bom_len = sizeof(utf16_le_bom);
> +		out_encoding = "UTF-16LE";
> +	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";

OK, you do allow BE-BOM and the code does not rely on the fact that
iconv happens to produce it with "UTF-16", because the library is
free to switch between the three possible output (a)-(c) and we do
not want to get affected by such a switch.  Makes sense.


  reply	other threads:[~2019-01-22 20:13 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
2018-11-04 15:47 ` brian m. carlson
2018-11-04 16:37   ` Adrián Gimeno Balaguer
2018-11-04 18:38     ` brian m. carlson
2018-11-04 17:07 ` Torsten Bögershausen
2018-11-05  4:24   ` Adrián Gimeno Balaguer
2018-11-05 18:10     ` Torsten Bögershausen
2018-11-06 20:16       ` Torsten Bögershausen
2018-11-07  4:38         ` Adrián Gimeno Balaguer
2018-11-08 17:02           ` Torsten Bögershausen
2018-12-26  0:56             ` Alexandre Grigoriev
2018-12-26 19:25               ` brian m. carlson
2018-12-27  2:52                 ` Alexandre Grigoriev
2018-12-27 14:45                   ` Torsten Bögershausen
2018-12-23 14:46   ` Alexandre Grigoriev
2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
     [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
2018-12-29 15:48     ` Adrián Gimeno Balaguer
2018-12-29 17:54       ` Philip Oakley
2019-01-20 16:43 ` [PATCH v2 " tboegi
2019-01-22 20:13   ` Junio C Hamano [this message]
2019-01-30 15:01 ` [PATCH v3 " tboegi
2019-01-30 15:24   ` Jason Pyeron
2019-01-30 17:49     ` Torsten Bögershausen
2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
2019-03-07  0:24   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqk1iwfo1v.fsf@gitster-ct.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=adrigibal@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).