git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git-rebase is ignoring working-tree-encoding
@ 2018-11-02  2:30 Adrián Gimeno Balaguer
  2018-11-04 15:47 ` brian m. carlson
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Adrián Gimeno Balaguer @ 2018-11-02  2:30 UTC (permalink / raw)
  To: git

I’m attempting to perform fixups via git-rebase of UTF-16 LE files
(the project I’m working on requires that exact encoding on certain
files). When the rebase is complete, Git changes that file’s encoding
to UTF-16 BE. I have been using the newer working-tree-encoding
attribute in .gitattributes. I’m using Git for Windows.

$ git version
git version 2.19.1.windows.1

Here is a sample UTF-16 LE file (with BOM and LF endings) with
following atributes in .gitattributes:

test.txt eol=lf -text working-tree-encoding=UTF-16

I put eol=lf and -text to tell Git to not change the encoding of the
file on checkout, but that doesn’t even help. Asides, the newer
working-tree-encoding allows me to view human-readable diffs of that
file (in GitHub Desktop and Git Bash). Now, note that doing for
example consecutive commits to the same file does not affect the
UTF-16 LE encoding. And before I discovered this attribute, the whole
thing was even worse when squash/fixup rebasing, as Git would modify
the file with Chinese characters (when manually setting it as text via
.gitattributes).

So, again the problem with the exposed .gitattributes line is that
after fixup rebasing, UTF-16 LE files encoding change to UTF-16 BE.

For long, I have been working with the involved UTF-16 LE files set as
binary via .gitattributes (e.g. “test.txt binary”), so that Git would
not modify the file encoding, but this doesn’t allow me to view the
diffs upon changes in GitHub Desktop, which I want (and neither via
git diff).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
@ 2018-11-04 15:47 ` brian m. carlson
  2018-11-04 16:37   ` Adrián Gimeno Balaguer
  2018-11-04 17:07 ` Torsten Bögershausen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: brian m. carlson @ 2018-11-04 15:47 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1295 bytes --]

On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote:
> I’m attempting to perform fixups via git-rebase of UTF-16 LE files
> (the project I’m working on requires that exact encoding on certain
> files). When the rebase is complete, Git changes that file’s encoding
> to UTF-16 BE. I have been using the newer working-tree-encoding
> attribute in .gitattributes. I’m using Git for Windows.
> 
> $ git version
> git version 2.19.1.windows.1
> 
> Here is a sample UTF-16 LE file (with BOM and LF endings) with
> following atributes in .gitattributes:
> 
> test.txt eol=lf -text working-tree-encoding=UTF-16

Do things work for you if you write this as "UTF-16LE"?  When you use
working-tree-encoding, the file is stored internally as UTF-8, but it's
serialized to the specified encoding when written out.

Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as
long as you get a BOM in the output, either one is an acceptable option.
Which one you get is dependent on what the underlying code thinks is the
default, and traditionally for Unix systems and Unix tools that's been
big-endian.  If you want a particular endianness, you should specify it.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-04 15:47 ` brian m. carlson
@ 2018-11-04 16:37   ` Adrián Gimeno Balaguer
  2018-11-04 18:38     ` brian m. carlson
  0 siblings, 1 reply; 25+ messages in thread
From: Adrián Gimeno Balaguer @ 2018-11-04 16:37 UTC (permalink / raw)
  To: sandals, git

El dom., 4 nov. 2018 a las 16:48, brian m. carlson
(<sandals@crustytoothpaste.net>) escribió:
> Do things work for you if you write this as "UTF-16LE"?  When you use
> working-tree-encoding, the file is stored internally as UTF-8, but it's
> serialized to the specified encoding when written out.

When I use UTF-16LE or UTF-16BE, then I can't commit or view diffs of
specified files, as Git prohibites BOM existance in these cases,
showing an error when attempting to commit. But BOM must also exist
for the project. I even experimented for fixing this issue within
Git's source. It turns out that Git is following an Unicode rule that
says that BOM is not permitted when declaring exact UTF-16BE/UTF-16LE
MIME (and UTF-32 variants) encoding types:

https://github.com/git/git/blob/master/utf8.h#L87

> Asking for "UTF-16" is ambiguous: there are two endiannesses, and so as
> long as you get a BOM in the output, either one is an acceptable option.
> Which one you get is dependent on what the underlying code thinks is the
> default, and traditionally for Unix systems and Unix tools that's been
> big-endian.  If you want a particular endianness, you should specify it.

I wrote a "counterpart" easy fix which instead only prohibites BOM for
the opposite endianness (for example if
working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the
file would cause Git to signal the error right before committing,
diffing, etc.). That way user would be encouraged to modify the file's
encoding to match the one specified in working-tree-encoding before
allowing these actions, therefore preventing Git from encoding to the
wrong endianness after file is written out. With few repository tests,
this new behaviour worked as expected. But then I realized this
solution would perhaps be unacceptable for Git's source code as it
would violate that Unicode standard. Anyways, here is a PR in my Git
fork with the changes I did, for reference:

https://github.com/AdRiAnIlloO/git/pull/1

Ah this point, the solution I came with recently for my project was
writing some code in Shell to fix the endianness of the re-encoded
files to UTF-16BE after the Git's write out process (or a "working
tree refresh" in my own words), within the same script that I use to
pack assets including the localization files.

> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204



-- 
Adrián

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
  2018-11-04 15:47 ` brian m. carlson
@ 2018-11-04 17:07 ` Torsten Bögershausen
  2018-11-05  4:24   ` Adrián Gimeno Balaguer
  2018-12-23 14:46   ` Alexandre Grigoriev
  2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Torsten Bögershausen @ 2018-11-04 17:07 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote:
> I’m attempting to perform fixups via git-rebase of UTF-16 LE files
> (the project I’m working on requires that exact encoding on certain
> files). When the rebase is complete, Git changes that file’s encoding
> to UTF-16 BE. I have been using the newer working-tree-encoding
> attribute in .gitattributes. I’m using Git for Windows.
> 
> $ git version
> git version 2.19.1.windows.1
> 
> Here is a sample UTF-16 LE file (with BOM and LF endings) with
> following atributes in .gitattributes:
> 
> test.txt eol=lf -text working-tree-encoding=UTF-16
> 
> I put eol=lf and -text to tell Git to not change the encoding of the
> file on checkout, but that doesn’t even help. Asides, the newer
> working-tree-encoding allows me to view human-readable diffs of that
> file (in GitHub Desktop and Git Bash). Now, note that doing for
> example consecutive commits to the same file does not affect the
> UTF-16 LE encoding. And before I discovered this attribute, the whole
> thing was even worse when squash/fixup rebasing, as Git would modify
> the file with Chinese characters (when manually setting it as text via
> .gitattributes).
> 
> So, again the problem with the exposed .gitattributes line is that
> after fixup rebasing, UTF-16 LE files encoding change to UTF-16 BE.
> 
> For long, I have been working with the involved UTF-16 LE files set as
> binary via .gitattributes (e.g. “test.txt binary”), so that Git would
> not modify the file encoding, but this doesn’t allow me to view the
> diffs upon changes in GitHub Desktop, which I want (and neither via
> git diff).

Thanks for the report.
I have tried to follow the problem from your verbal descriptions
(and the PR) but I need to admit that I don't fully understand the
problem (yet).

Could you try to create some instructions how to reproduce it?
A numer of shell istructions would be great,
in best case some kind of "test case", like the tests in
the t/ directory in Git.

It would be nice to be able to re-produce it.
And if there is a bug, to get it fixed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-04 16:37   ` Adrián Gimeno Balaguer
@ 2018-11-04 18:38     ` brian m. carlson
  0 siblings, 0 replies; 25+ messages in thread
From: brian m. carlson @ 2018-11-04 18:38 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

On Sun, Nov 04, 2018 at 05:37:09PM +0100, Adrián Gimeno Balaguer wrote:
> I wrote a "counterpart" easy fix which instead only prohibites BOM for
> the opposite endianness (for example if
> working-tree-encoding=UTF-16LE, then finding an UTF-16BE BOM in the
> file would cause Git to signal the error right before committing,
> diffing, etc.). That way user would be encouraged to modify the file's
> encoding to match the one specified in working-tree-encoding before
> allowing these actions, therefore preventing Git from encoding to the
> wrong endianness after file is written out. With few repository tests,
> this new behaviour worked as expected. But then I realized this
> solution would perhaps be unacceptable for Git's source code as it
> would violate that Unicode standard. Anyways, here is a PR in my Git
> fork with the changes I did, for reference:

I actually think such a solution (although I haven't looked at your
patch) would be fine, and I would encourage you to send it to the list.
It's my understanding that many people on Windows want to write things
in UTF-16 encoding but only little-endian with a BOM.  Allowing them to
write that, even if Git won't be able to guarantee producing that, would
be fine, as long as the data is what we expect.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-04 17:07 ` Torsten Bögershausen
@ 2018-11-05  4:24   ` Adrián Gimeno Balaguer
  2018-11-05 18:10     ` Torsten Bögershausen
  2018-12-23 14:46   ` Alexandre Grigoriev
  1 sibling, 1 reply; 25+ messages in thread
From: Adrián Gimeno Balaguer @ 2018-11-05  4:24 UTC (permalink / raw)
  To: tboegi; +Cc: git

El dom., 4 nov. 2018 a las 18:07, Torsten Bögershausen
(<tboegi@web.de>) escribió:
>
> Thanks for the report.
> I have tried to follow the problem from your verbal descriptions
> (and the PR) but I need to admit that I don't fully understand the
> problem (yet).

I have created a PR in the Git's repository. You can read an updated
description there:

https://github.com/git/git/pull/550

> Could you try to create some instructions how to reproduce it?
> A numer of shell instructions would be great,
> in best case some kind of "test case", like the tests in
> the t/ directory in Git.
>
> It would be nice to be able to re-produce it.
> And if there is a bug, to get it fixed.

This is covered in the mentioned PR above. Thanks for feedback.

-- 
Adrián

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-05  4:24   ` Adrián Gimeno Balaguer
@ 2018-11-05 18:10     ` Torsten Bögershausen
  2018-11-06 20:16       ` Torsten Bögershausen
  0 siblings, 1 reply; 25+ messages in thread
From: Torsten Bögershausen @ 2018-11-05 18:10 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:

[]

> https://github.com/git/git/pull/550
 
[]
 
> This is covered in the mentioned PR above. Thanks for feedback.

Thanks for the code,
I will have a look (the next days)

> 
> -- 
> Adrián

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-05 18:10     ` Torsten Bögershausen
@ 2018-11-06 20:16       ` Torsten Bögershausen
  2018-11-07  4:38         ` Adrián Gimeno Balaguer
  0 siblings, 1 reply; 25+ messages in thread
From: Torsten Bögershausen @ 2018-11-06 20:16 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote:
> On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:
> 
> []
> 
> > https://github.com/git/git/pull/550
>  
> []
>  
> > This is covered in the mentioned PR above. Thanks for feedback.
> 
> Thanks for the code,
> I will have a look (the next days)
> 
> > 
> > -- 
> > Adrián

Hej Adrián,

I still didn't manage to fully understand your problem.
I tried to convert your test into my understanding,
It can be fetched here (or copied from this message, see below)

https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit

The commit of an empty file seems to work for me, in the initial
report a "rebase" was mentioned, which is not in the TC ?

Is the following what you intended to test ?

#!/bin/sh
test_description='UTF-16 LE/BE file encoding using working-tree-encoding'


. ./test-lib.sh

# We specify the UTF-16LE BOM manually, to not depend on programs such as iconv.
utf16leBOM=$(printf '\377\376')

test_expect_success 'Stage empty UTF-16LE file as binary' '
	>empty_0.txt &&
	echo "empty_0.txt binary" >>.gitattributes &&
	git add empty_0.txt
'


test_expect_success 'Stage empty file with enc=UTF.16BL' '
	>utf16le_0.txt &&
	echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" >>.gitattributes &&
	git add utf16le_0.txt
'


test_expect_success 'Create and stage UTF-16LE file with only BOM' '
	printf "$utf16leBOM" >utf16le_1.txt &&
	echo "utf16le_1.txt text working-tree-encoding=UTF-16" >>.gitattributes &&
	git add utf16le_1.txt
'

test_expect_success 'Dont stage UTF-16LE file with only BOM with enc=UTF.16BE' '
	printf "$utf16leBOM" >utf16le_2.txt &&
	echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" >>.gitattributes &&
	test_must_fail git add utf16le_2.txt
'

test_expect_success 'commit all files' '
	test_tick &&
	git commit -m "Commit all 3 files"
'

test_expect_success 'All commited files have the same sha' '
	git ls-files -s --eol >tmp1 &&
	sed -e "s!	i/none.*!!" <tmp1 | uniq -u >actual &&
	>expect &&
	test_cmp expect actual
'

test_done

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-06 20:16       ` Torsten Bögershausen
@ 2018-11-07  4:38         ` Adrián Gimeno Balaguer
  2018-11-08 17:02           ` Torsten Bögershausen
  0 siblings, 1 reply; 25+ messages in thread
From: Adrián Gimeno Balaguer @ 2018-11-07  4:38 UTC (permalink / raw)
  To: tboegi; +Cc: git

Hello Torsten,

Thanks for answering.

Answering to your question, I removed the comments with "rebase" since
my reported encoding issue happens on more simpler operations
(described in the PR), and the problem is not directly related to
rebasing, so I considered it better in order to avoid unrelated
confusions.

Let's get back to the problem. Each system has a default endianness.
Also, in .gitattributes's working-tree-encoding, Git behaves
differently depending on the attribute's value and the contents of the
referenced entry file. When I put the value "UTF-16", then the file
must have a BOM, or Git complains. Otherwise, if I put the value
"UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a
BOM for that main encoding (UTF-16 here), which can be relate to any
endianness.

My very initial goal was, given a UTF-16LE file, to be able to view
human-readable diffs whenever I make a change on it (and yes, it must
be Little Endian). Plus, this file had a BOM. Now, what are the
options with Git currently (consider only working-tree-encoding)? If I
put working-tree-encoding=UTF-16, then I could view readable diffs and
commit the file, but here is the main problem: Git looses information
about what initial endianness the file had, therefore, after
staging/committing it re-encodes the file from UTF-8 (as stored
internally) to UTF-16 and the default system endianness. In my case it
did to Big Endian, thus affecting the project's requirement. That is
why I ended up writing a fixup script to change the encoding back to
UTF-16LE.

On the other hand, once I set working-tree-encoding=UTF-16LE, then Git
prohibited me from committing the file and even viewing human-readable
diffs (the output simply tells it's a binary file). In this sense, the
internal location of these  errors is within the function of utf8.c I
made changes to in the PR. I hope I was clearer!

Finally, Git behaviour around this is based on Unicode standards,
which is why I acknowledged that my changes violated them after
refering to a link which is present in the ut8.h file.
El mar., 6 nov. 2018 a las 21:16, Torsten Bögershausen
(<tboegi@web.de>) escribió:
>
> On Mon, Nov 05, 2018 at 07:10:14PM +0100, Torsten Bögershausen wrote:
> > On Mon, Nov 05, 2018 at 05:24:39AM +0100, Adrián Gimeno Balaguer wrote:
> >
> > []
> >
> > > https://github.com/git/git/pull/550
> >
> > []
> >
> > > This is covered in the mentioned PR above. Thanks for feedback.
> >
> > Thanks for the code,
> > I will have a look (the next days)
> >
> > >
> > > --
> > > Adrián
>
> Hej Adrián,
>
> I still didn't manage to fully understand your problem.
> I tried to convert your test into my understanding,
> It can be fetched here (or copied from this message, see below)
>
> https://github.com/tboegi/git/tree/tb.181106_UTF16LE_commit
>
> The commit of an empty file seems to work for me, in the initial
> report a "rebase" was mentioned, which is not in the TC ?
>
> Is the following what you intended to test ?
>
> #!/bin/sh
> test_description='UTF-16 LE/BE file encoding using working-tree-encoding'
>
>
> . ./test-lib.sh
>
> # We specify the UTF-16LE BOM manually, to not depend on programs such as iconv.
> utf16leBOM=$(printf '\377\376')
>
> test_expect_success 'Stage empty UTF-16LE file as binary' '
>         >empty_0.txt &&
>         echo "empty_0.txt binary" >>.gitattributes &&
>         git add empty_0.txt
> '
>
>
> test_expect_success 'Stage empty file with enc=UTF.16BL' '
>         >utf16le_0.txt &&
>         echo "utf16le_0.txt text working-tree-encoding=UTF-16BE" >>.gitattributes &&
>         git add utf16le_0.txt
> '
>
>
> test_expect_success 'Create and stage UTF-16LE file with only BOM' '
>         printf "$utf16leBOM" >utf16le_1.txt &&
>         echo "utf16le_1.txt text working-tree-encoding=UTF-16" >>.gitattributes &&
>         git add utf16le_1.txt
> '
>
> test_expect_success 'Dont stage UTF-16LE file with only BOM with enc=UTF.16BE' '
>         printf "$utf16leBOM" >utf16le_2.txt &&
>         echo "utf16le_2.txt text working-tree-encoding=UTF-16BE" >>.gitattributes &&
>         test_must_fail git add utf16le_2.txt
> '
>
> test_expect_success 'commit all files' '
>         test_tick &&
>         git commit -m "Commit all 3 files"
> '
>
> test_expect_success 'All commited files have the same sha' '
>         git ls-files -s --eol >tmp1 &&
>         sed -e "s!      i/none.*!!" <tmp1 | uniq -u >actual &&
>         >expect &&
>         test_cmp expect actual
> '
>
> test_done



-- 
Adrián

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-11-07  4:38         ` Adrián Gimeno Balaguer
@ 2018-11-08 17:02           ` Torsten Bögershausen
  2018-12-26  0:56             ` Alexandre Grigoriev
  0 siblings, 1 reply; 25+ messages in thread
From: Torsten Bögershausen @ 2018-11-08 17:02 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer; +Cc: git

On Wed, Nov 07, 2018 at 05:38:18AM +0100, Adrián Gimeno Balaguer wrote:
> Hello Torsten,
> 
> Thanks for answering.
> 
> Answering to your question, I removed the comments with "rebase" since
> my reported encoding issue happens on more simpler operations
> (described in the PR), and the problem is not directly related to
> rebasing, so I considered it better in order to avoid unrelated
> confusions.
> 
> Let's get back to the problem. Each system has a default endianness.
> Also, in .gitattributes's working-tree-encoding, Git behaves
> differently depending on the attribute's value and the contents of the
> referenced entry file. When I put the value "UTF-16", then the file
> must have a BOM, or Git complains. Otherwise, if I put the value
> "UTF-16BE" or "UTF-16LE", then Git prohibites operations if file has a
> BOM for that main encoding (UTF-16 here), which can be relate to any
> endianness.
> 
> My very initial goal was, given a UTF-16LE file, to be able to view
> human-readable diffs whenever I make a change on it (and yes, it must
> be Little Endian). Plus, this file had a BOM. Now, what are the
> options with Git currently (consider only working-tree-encoding)? If I
> put working-tree-encoding=UTF-16, then I could view readable diffs and
> commit the file, but here is the main problem: Git looses information
> about what initial endianness the file had, therefore, after
> staging/committing it re-encodes the file from UTF-8 (as stored
> internally) to UTF-16 and the default system endianness. In my case it
> did to Big Endian, thus affecting the project's requirement. That is
> why I ended up writing a fixup script to change the encoding back to
> UTF-16LE.

OK, I think I understand your problem now.
The file format which you ask for could be named "UTF-16-BOM-LE",
but that does not exist in reality.
If you use UTF-16, then there must be a BOM, and if there is a BOM,
then a Unicode-aware application -should- be able to handle it.

Why does your project require such a format ?

> 
> On the other hand, once I set working-tree-encoding=UTF-16LE, then Git
> prohibited me from committing the file and even viewing human-readable
> diffs (the output simply tells it's a binary file). In this sense, the
> internal location of these  errors is within the function of utf8.c I
> made changes to in the PR. I hope I was clearer!
> 
> Finally, Git behaviour around this is based on Unicode standards,
> which is why I acknowledged that my changes violated them after
> refering to a link which is present in the ut8.h file.

[]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: git-rebase is ignoring working-tree-encoding
  2018-11-04 17:07 ` Torsten Bögershausen
  2018-11-05  4:24   ` Adrián Gimeno Balaguer
@ 2018-12-23 14:46   ` Alexandre Grigoriev
  1 sibling, 0 replies; 25+ messages in thread
From: Alexandre Grigoriev @ 2018-12-23 14:46 UTC (permalink / raw)
  To: 'Torsten Bögershausen',
	'Adrián Gimeno Balaguer'
  Cc: git



>On Fri, Nov 02, 2018 at 03:30:17AM +0100, Adrián Gimeno Balaguer wrote:
>> I’m attempting to perform fixups via git-rebase of UTF-16 LE files
>> (the project I’m working on requires that exact encoding on certain
>> files). When the rebase is complete, Git changes that file’s encoding
>> to UTF-16 BE. I have been using the newer working-tree-encoding
>> attribute in .gitattributes. I’m using Git for Windows.
>> 
>> $ git version
>> git version 2.19.1.windows.1
>> 

>Thanks for the report.
>I have tried to follow the problem from your verbal descriptions
>(and the PR) but I need to admit that I don't fully understand the
>problem (yet).
>Could you try to create some instructions how to reproduce it?
>A numer of shell istructions would be great,
>in best case some kind of "test case", like the tests in
>the t/ directory in Git.
>It would be nice to be able to re-produce it.
>And if there is a bug, to get it fixed.

This is not as much Git issue (and not rebase issue at all), as libiconv issue.

Iconv program exhibits the same behavior. If you ask it to convert to UTF-16,
It will produce UTF-16BE with BOM.

That said, it appears that Centos (tested on 7.4) devs have seen the wrong in it and patched libiconv to produce UTF-16LE with BOM.
Git on Centos does check out files as UTF-16LE, and iconv produces these files, as well.
Just need to find out what patch they applied to libiconv.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: git-rebase is ignoring working-tree-encoding
  2018-11-08 17:02           ` Torsten Bögershausen
@ 2018-12-26  0:56             ` Alexandre Grigoriev
  2018-12-26 19:25               ` brian m. carlson
  0 siblings, 1 reply; 25+ messages in thread
From: Alexandre Grigoriev @ 2018-12-26  0:56 UTC (permalink / raw)
  To: 'Torsten Bögershausen',
	'Adrián Gimeno Balaguer'
  Cc: git

> -----Original Message-----
> From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
> Behalf Of Torsten Bogershausen
> Sent: Thursday, November 8, 2018 9:03 AM
> To: Adrián Gimeno Balaguer
> Cc: git@vger.kernel.org
> Subject: Re: git-rebase is ignoring working-tree-encoding
> 
> On Wed, Nov 07, 2018 at 05:38:18AM +0100, Adrián Gimeno Balaguer wrote:
> > Hello Torsten,
> >
> > Thanks for answering.
> >
> > Answering to your question, I removed the comments with "rebase" since
> > my reported encoding issue happens on more simpler operations
> > (described in the PR), and the problem is not directly related to
> > rebasing, so I considered it better in order to avoid unrelated
> > confusions.
> >

> OK, I think I understand your problem now.
> The file format which you ask for could be named "UTF-16-BOM-LE",
> but that does not exist in reality.
> If you use UTF-16, then there must be a BOM, and if there is a BOM,
> then a Unicode-aware application -should- be able to handle it.
> 
> Why does your project require such a format ?
> 

Many tools in Windows still do not understand UTF-8, although it's getting
better. I think Windows is about the only OS where tools still require
UTF-16 for full internationalization.
Many tools written in C use MSVC RTL, where fopen(), unfortunately, doesn't
understand UTF-16BE (though such a rudimentary program as Notepad does).

For this reason, it's very reasonable to ask that the programming tools
produce UTF-16 files with particular endianness, natural for the platform
they're running on.

The iconv programmers' boneheaded decision to always produce UTF-16BE with
BOM for UTF-16 output doesn't make sense.
Again, git and iconv/libiconv in Centos on x86 do the right thing and
produce UTF-16LE with BOM in this case.

Also, iconv/libiconv should not be rejecting files with BOM for input
encoding UTF-16BE or UTF-16LE.
The BOM is not some magic tag. It's just a zero-width space, with unique
property that its 8 and 16 bit encoding variants can be recognized one from
another. It can appear anywhere in a file.
If it's a first character in the file, then the file encoding can be
reliably detected. But it's just a character, and iconv should be accepting
such files as valid.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-12-26  0:56             ` Alexandre Grigoriev
@ 2018-12-26 19:25               ` brian m. carlson
  2018-12-27  2:52                 ` Alexandre Grigoriev
  0 siblings, 1 reply; 25+ messages in thread
From: brian m. carlson @ 2018-12-26 19:25 UTC (permalink / raw)
  To: Alexandre Grigoriev
  Cc: 'Torsten Bögershausen',
	'Adrián Gimeno Balaguer', git

[-- Attachment #1: Type: text/plain, Size: 1292 bytes --]

On Tue, Dec 25, 2018 at 04:56:11PM -0800, Alexandre Grigoriev wrote:
> Many tools in Windows still do not understand UTF-8, although it's getting
> better. I think Windows is about the only OS where tools still require
> UTF-16 for full internationalization.
> Many tools written in C use MSVC RTL, where fopen(), unfortunately, doesn't
> understand UTF-16BE (though such a rudimentary program as Notepad does).
> 
> For this reason, it's very reasonable to ask that the programming tools
> produce UTF-16 files with particular endianness, natural for the platform
> they're running on.
> 
> The iconv programmers' boneheaded decision to always produce UTF-16BE with
> BOM for UTF-16 output doesn't make sense.
> Again, git and iconv/libiconv in Centos on x86 do the right thing and
> produce UTF-16LE with BOM in this case.

A program which claims to support "UTF-16" must support both
endiannesses, according to RFC 2781. A program writing UTF-16-LE must
not write a BOM at the beginning. I realize this is inconvenient, but
the bad behavior of some Windows programs doesn't mean that Git should
ignore interoperability with non-Windows systems using UTF-16 correctly
in favor of Windows.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: git-rebase is ignoring working-tree-encoding
  2018-12-26 19:25               ` brian m. carlson
@ 2018-12-27  2:52                 ` Alexandre Grigoriev
  2018-12-27 14:45                   ` Torsten Bögershausen
  0 siblings, 1 reply; 25+ messages in thread
From: Alexandre Grigoriev @ 2018-12-27  2:52 UTC (permalink / raw)
  To: 'brian m. carlson'
  Cc: 'Torsten Bögershausen',
	'Adrián Gimeno Balaguer', git


> -----Original Message-----
> From: brian m. carlson [mailto:sandals@crustytoothpaste.net]
> Sent: Wednesday, December 26, 2018 11:25 AM
> To: Alexandre Grigoriev
> Cc: 'Torsten Bögershausen'; 'Adrián Gimeno Balaguer'; git@vger.kernel.org
> Subject: Re: git-rebase is ignoring working-tree-encoding
> 
> On Tue, Dec 25, 2018 at 04:56:11PM -0800, Alexandre Grigoriev wrote:
> > Many tools in Windows still do not understand UTF-8, although it's
> > getting better. I think Windows is about the only OS where tools still
> > require
> > UTF-16 for full internationalization.
> > Many tools written in C use MSVC RTL, where fopen(), unfortunately,
> > doesn't understand UTF-16BE (though such a rudimentary program as
> Notepad does).
> >
> > For this reason, it's very reasonable to ask that the programming
> > tools produce UTF-16 files with particular endianness, natural for the
> > platform they're running on.
> >
> > The iconv programmers' boneheaded decision to always produce UTF-16BE
> > with BOM for UTF-16 output doesn't make sense.
> > Again, git and iconv/libiconv in Centos on x86 do the right thing and
> > produce UTF-16LE with BOM in this case.
> 
> A program which claims to support "UTF-16" must support both
> endiannesses, according to RFC 2781. A program writing UTF-16-LE must not
> write a BOM at the beginning. I realize this is inconvenient, but the bad
> behavior of some Windows programs doesn't mean that Git should ignore
> interoperability with non-Windows systems using UTF-16 correctly in favor of
> Windows.

OK, we have a choice either:
a) to live in that corner of the real world where you have to use available tools, some of which have historical reasons
to only support UTF-16LE with BOM, because nobody ever throws a different flavor of UTF-16 at them;
Or b) to live in an ivory tower where you don't really need to use UTF-16 LE or BE or any other flavor,
because everything is just UTF-8, and tell all those other people using that lame OS to shut up and wait until their tools start to support
the formats you don't really have to care about;

> behavior of some Windows programs doesn't mean that Git should ignore
> interoperability with non-Windows systems using UTF-16 correctly in favor of
> Windows.

Yes, Git (actually libiconv) should not ignore interoperability.
This means it should check out files on a *Windows* system in a format which *Windows* tools
can understand.
And, by the way, Centos (or RedHat?) developers understood that.
There, on an x86 installation, when you ask for UTF-16, it produces UTF-16LE with BOM.
Just as every user there would want.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: git-rebase is ignoring working-tree-encoding
  2018-12-27  2:52                 ` Alexandre Grigoriev
@ 2018-12-27 14:45                   ` Torsten Bögershausen
  0 siblings, 0 replies; 25+ messages in thread
From: Torsten Bögershausen @ 2018-12-27 14:45 UTC (permalink / raw)
  To: Alexandre Grigoriev
  Cc: 'brian m. carlson', 'Adrián Gimeno Balaguer',
	git

On Wed, Dec 26, 2018 at 06:52:56PM -0800, Alexandre Grigoriev wrote:
> 
> > -----Original Message-----
> > From: brian m. carlson [mailto:sandals@crustytoothpaste.net]
> > Sent: Wednesday, December 26, 2018 11:25 AM
> > To: Alexandre Grigoriev
> > Cc: 'Torsten Bögershausen'; 'Adrián Gimeno Balaguer'; git@vger.kernel.org
> > Subject: Re: git-rebase is ignoring working-tree-encoding
> > 
> > On Tue, Dec 25, 2018 at 04:56:11PM -0800, Alexandre Grigoriev wrote:
> > > Many tools in Windows still do not understand UTF-8, although it's
> > > getting better. I think Windows is about the only OS where tools still
> > > require
> > > UTF-16 for full internationalization.
> > > Many tools written in C use MSVC RTL, where fopen(), unfortunately,
> > > doesn't understand UTF-16BE (though such a rudimentary program as
> > Notepad does).
> > >
> > > For this reason, it's very reasonable to ask that the programming
> > > tools produce UTF-16 files with particular endianness, natural for the
> > > platform they're running on.
> > >
> > > The iconv programmers' boneheaded decision to always produce UTF-16BE
> > > with BOM for UTF-16 output doesn't make sense.
> > > Again, git and iconv/libiconv in Centos on x86 do the right thing and
> > > produce UTF-16LE with BOM in this case.
> > 
> > A program which claims to support "UTF-16" must support both
> > endiannesses, according to RFC 2781. A program writing UTF-16-LE must not
> > write a BOM at the beginning. I realize this is inconvenient, but the bad
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
> 
> OK, we have a choice either:
> a) to live in that corner of the real world where you have to use available tools, some of which have historical reasons
> to only support UTF-16LE with BOM, because nobody ever throws a different flavor of UTF-16 at them;
> Or b) to live in an ivory tower where you don't really need to use UTF-16 LE or BE or any other flavor,
> because everything is just UTF-8, and tell all those other people using that lame OS to shut up and wait until their tools start to support
> the formats you don't really have to care about;
> 
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
> 
> Yes, Git (actually libiconv) should not ignore interoperability.
> This means it should check out files on a *Windows* system in a format which *Windows* tools
> can understand.
> And, by the way, Centos (or RedHat?) developers understood that.
> There, on an x86 installation, when you ask for UTF-16, it produces UTF-16LE with BOM.
> Just as every user there would want.
> 
> 

Sorry if I feel confused here - does the problem still exist ?
If yes, does the following patch help ?



diff --git a/utf8.c b/utf8.c
index eb78587504..2facef84d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -9,6 +9,23 @@ struct interval {
 	ucs_char_t last;
 };
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
+static inline uint16_t default_swab16(uint16_t val)
+{
+	return (((val & 0xff00) >>  8) |
+		((val & 0x00ff) <<  8));
+}
+
 size_t display_mode_esc_sequence_len(const char *s)
 {
 	const char *p = s;
@@ -556,21 +573,19 @@ char *reencode_string_len(const char *in, size_t insz,
 
 	out = reencode_string_iconv(in, insz, conv, outsz);
 	iconv_close(conv);
+	if (has_bom_prefix(out, *outsz, utf16_be_bom, sizeof(utf16_be_bom))) {
+		/* UTF-16 should be little endian under Git */
+		size_t    num_points = *outsz / sizeof(uint16_t);
+		uint16_t *point = (uint16_t*) out;
+		while (num_points--) {
+			*point = default_swab16(*point);
+			point++;
+		}
+	}
 	return out;
 }
 #endif
 
-static int has_bom_prefix(const char *data, size_t len,
-			  const char *bom, size_t bom_len)
-{
-	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
-}
-
-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
  2018-11-04 15:47 ` brian m. carlson
  2018-11-04 17:07 ` Torsten Bögershausen
@ 2018-12-29 11:09 ` tboegi
       [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
  2019-01-20 16:43 ` [PATCH v2 " tboegi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: tboegi @ 2018-12-29 11:09 UTC (permalink / raw)
  To: git, adrigibal; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

After a checkout, the resulting file has a BOM and is encoded in "UTF-16".
The unicode standard allows both little- and big-endianess (LE/BE) for
those files, the BOM will tell which one is used inside the file.
iconv seems to prefer the BE version.
Not all users under Windows are happy with this when tools are not fully
unicode aware and don't digest the BE version at all.

Today there is no name for "UTF-16 with BOM, little endian please".
Introduce "UTF-16LE-BOM".

Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

This feels like an RFC at the moment - please comment.
Using UTF-16 in the way "UTF-16LE-BOM" is used in this patch
could be an alternative - simply produce UTF-16 in LE version
under Git - this could make people using Git happy as well.

Documentation/gitattributes.txt  |  4 +--
 compat/precompose_utf8.c         |  2 +-
 t/t0028-working-tree-encoding.sh | 12 ++++++++-
 utf8.c                           | 42 ++++++++++++++++++++++++--------
 utf8.h                           |  2 +-
 5 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..4a88ab8be7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
 ------------------------
 
 Use the following attributes if your '*.ps1' files are UTF-16 little
-endian encoded without BOM and you want Git to use Windows line endings
+endian encoded with BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.
 
 ------------------------
-*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
 ------------------------
 
 You can get a list of all available encodings on your platform with the
diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
 		size_t namelen;
 		oldarg = argv[i];
 		if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, NULL);
+			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, 0, NULL);
 			if (newarg)
 				argv[i] = newarg;
 		}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '
 
 	text="hallo there!\ncan you read me?" &&
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "\377\376"                         >test.utf16lebom.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
 	# Line ending tests
 	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
@@ -32,7 +35,8 @@ test_expect_success 'setup test files' '
 	# Add only UTF-16 file, we will add the UTF-32 file later
 	cp test.utf16.raw test.utf16 &&
 	cp test.utf32.raw test.utf32 &&
-	git add .gitattributes test.utf16 &&
+	cp test.utf16lebom.raw test.utf16lebom &&
+	git add .gitattributes test.utf16 test.utf16lebom &&
 	git commit -m initial
 '
 
@@ -51,6 +55,12 @@ test_expect_success 're-encode to UTF-16 on checkout' '
 	test_cmp_bin test.utf16.raw test.utf16
 '
 
+test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
+	rm test.utf16lebom &&
+	git checkout test.utf16lebom &&
+	test_cmp_bin test.utf16lebom.raw test.utf16lebom
+'
+
 test_expect_success 'check $GIT_DIR/info/attributes support' '
 	test_when_finished "rm -f test.utf32.git" &&
 	test_when_finished "git reset --hard HEAD" &&
diff --git a/utf8.c b/utf8.c
index eb78587504..83824dc2f4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4,6 +4,11 @@
 
 /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */
 
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
 struct interval {
 	ucs_char_t first;
 	ucs_char_t last;
@@ -470,16 +475,17 @@ int utf8_fprintf(FILE *stream, const char *format, ...)
 #else
 	typedef char * iconv_ibp;
 #endif
-char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, size_t *outsz_p)
+char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
+			    size_t bom_len, size_t *outsz_p)
 {
 	size_t outsz, outalloc;
 	char *out, *outpos;
 	iconv_ibp cp;
 
 	outsz = insz;
-	outalloc = st_add(outsz, 1); /* for terminating NUL */
+	outalloc = st_add(outsz, 1 + bom_len); /* for terminating NUL */
 	out = xmalloc(outalloc);
-	outpos = out;
+	outpos = out + bom_len;
 	cp = (iconv_ibp)in;
 
 	while (1) {
@@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
 {
 	iconv_t conv;
 	char *out;
+	const char *bom_str = NULL;
+	size_t bom_len = 0;
 
 	if (!in_encoding)
 		return NULL;
 
+	/* UTF-16LE-BOM is the same as UTF-16 for reading */
+	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
+		in_encoding = "UTF-16";
+
+	/*
+	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
+	 * Some users under Windows want the little endian version
+	 */
+	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
+	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	}
+
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1) {
 		in_encoding = fallback_encoding(in_encoding);
@@ -553,9 +579,10 @@ char *reencode_string_len(const char *in, size_t insz,
 		if (conv == (iconv_t) -1)
 			return NULL;
 	}
-
-	out = reencode_string_iconv(in, insz, conv, outsz);
+	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
 	iconv_close(conv);
+	if (out && bom_str && bom_len)
+		memcpy(out, bom_str, bom_len);
 	return out;
 }
 #endif
@@ -566,11 +593,6 @@ static int has_bom_prefix(const char *data, size_t len,
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
 }
 
-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
diff --git a/utf8.h b/utf8.h
index edea55e093..84efbfcb1f 100644
--- a/utf8.h
+++ b/utf8.h
@@ -27,7 +27,7 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,
 
 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
-			    iconv_t conv, size_t *outsz);
+			    iconv_t conv, size_t bom_len, size_t *outsz);
 char *reencode_string_len(const char *in, size_t insz,
 			  const char *out_encoding,
 			  const char *in_encoding,
-- 
2.20.1.2.gb21ebb671b


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"
       [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
@ 2018-12-29 15:48     ` Adrián Gimeno Balaguer
  2018-12-29 17:54       ` Philip Oakley
  0 siblings, 1 reply; 25+ messages in thread
From: Adrián Gimeno Balaguer @ 2018-12-29 15:48 UTC (permalink / raw)
  To: tboegi, git

Hello again.

I appreciate the grown interest in this issue.

Torsten, may I know what is the benefit on your code? My PR solved it
by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
is likely the shortest way:

https://github.com/git/git/pull/550/files

In order to make sure everything is clear, here is a case list of
current Git behaviour and new one after my PR, regarding this issue.

Current behaviour:

- Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
file with either UTF-16 BE or LE BOM, and comitting everything -> The
file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
the default system/libiconv endianness -> Problem (as long as user
required the opposite endianness for any reason on his project). As a
note, user can see however human-readable diffs on that file.

- Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
working-tree-encoding=UTF-16BE' for a new test.txt file with either
UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
this because he requires that exact endianness, thus he writes it in
order to attempt preserving it -> Git prohibites commiting it, also no
human-readable diff is shown in the diff viewer/tool being used, but
file is simply shown as binary.

New behaviour:

-  Just got too lazy to repeat it all over, read my PR description:
https://github.com/git/git/pull/550

- Git translations may need to be tweaked to in order to be consistent
with new behaviour.

Thanks for your attention.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2018-12-29 15:48     ` Adrián Gimeno Balaguer
@ 2018-12-29 17:54       ` Philip Oakley
  0 siblings, 0 replies; 25+ messages in thread
From: Philip Oakley @ 2018-12-29 17:54 UTC (permalink / raw)
  To: Adrián Gimeno Balaguer, tboegi, git; +Cc: brian m. carlson

(adding Brian as cc who was in the original thread)

On 29/12/2018 15:48, Adrián Gimeno Balaguer wrote:
> Hello again.
> 
> I appreciate the grown interest in this issue.
> 
> Torsten, may I know what is the benefit on your code? My PR solved it
> by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
> is likely the shortest way:
> 
> https://github.com/git/git/pull/550/files

My main complaint with the PR would be the lack of documentation updates.

As the discussion has highlighted, whatever our solution, we will need 
to tell the users in plain and simple terms which parts of which 
standards are being used, and why we need to be somehow 'different'.

That is because a revision control system must be able to recover the 
original, for use in the original software tool, not just interpret it 
is some alternate form. The standards generally abdicate responsibility 
for the last step ;-)

I did not fully understand the conversion process you proposed, as I 
assumed(?) that on receipt of the source file, the Git conversion to 
utf-8 would convert the 16-bit BOM to the three byte utf-8 BOM byte 
sequence `EF BB BF` which has lost any knowledge of the original BE/LE 
coding.

Or, are we saying that the the 16-bit BOM is being interpreted as, a) 
the BE/LE indicator and b) a genuine "ZERO WIDTH NON-BREAKING SPACE" 
which is stored as the two byte utf-8 character code, again loosing 
(once stored as a blob object) the BE/LE indication.

Or, we see the BOM, note the endianness and then loose the BOM character 
when converting to utf-8. My ignorance of this step is starting to show. 
Regular users are probably even more confused, hence my hope for some 
documentation.

Given the above confusions, and many more when exploring the internet, 
the provision of a new, extra, clear, name for the encoding, as 
suggested by Torsten does offer an advantage in that it explicitly 
(rather than implicitly) makes plain what we are trying to do, without 
squeezing it in 'under the radar'.

That said, assuming an appropriate internal utf-8 Git coding that does 
remember the BE/LE state [if so how?] then the PR is a neat trick.

Torsten's patch also suffers from the lack of user facing documentation.

> 
> In order to make sure everything is clear, here is a case list of
> current Git behaviour and new one after my PR, regarding this issue.
> 
> Current behaviour:
> 
> - Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
> file with either UTF-16 BE or LE BOM, and comitting everything -> The
> file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
> the default system/libiconv endianness -> Problem (as long as user
> required the opposite endianness for any reason on his project). As a
> note, user can see however human-readable diffs on that file.
> 
> - Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
> working-tree-encoding=UTF-16BE' for a new test.txt file with either
> UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
> this because he requires that exact endianness, thus he writes it in
> order to attempt preserving it -> Git prohibites commiting it, also no
> human-readable diff is shown in the diff viewer/tool being used, but
> file is simply shown as binary.
> 
> New behaviour:
> 
> -  Just got too lazy to repeat it all over, read my PR description:
> https://github.com/git/git/pull/550

"In this PR: Git only prohibites the opposite BOM than the one in 
working-tree-encoding (e.g. if declared LE, then it denies BE BOM 
presence within the associated file, of the declared UTF-16/UTF-32). 
This way the user can now make Git operations which were previously 
impossible, with the only requisite being to match the endianness of 
working-tree-encoding attribute with the associated file/s."

> 
> - Git translations may need to be tweaked to in order to be consistent
> with new behaviour.
> 
> Thanks for your attention.
> 
-- 
Philip

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
                   ` (2 preceding siblings ...)
  2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
@ 2019-01-20 16:43 ` tboegi
  2019-01-22 20:13   ` Junio C Hamano
  2019-01-30 15:01 ` [PATCH v3 " tboegi
  2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
  5 siblings, 1 reply; 25+ messages in thread
From: tboegi @ 2019-01-20 16:43 UTC (permalink / raw)
  To: git, adrigibal; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 possible ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
0000000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
0000000    g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
0000000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practise we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will
be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream.

Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

I still think it makes sense to support  UTF-16, little endian and
with BOM in Git.
This V2 should make more clear, what standards we follow, and why
the naming scheme of Unicode does not cover all use cases in real world.

 Documentation/gitattributes.txt  |  4 +--
 compat/precompose_utf8.c         |  2 +-
 t/t0028-working-tree-encoding.sh | 12 ++++++++-
 utf8.c                           | 42 ++++++++++++++++++++++++--------
 utf8.h                           |  2 +-
 5 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..4a88ab8be7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
 ------------------------

 Use the following attributes if your '*.ps1' files are UTF-16 little
-endian encoded without BOM and you want Git to use Windows line endings
+endian encoded with BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

 ------------------------
-*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
 		size_t namelen;
 		oldarg = argv[i];
 		if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, NULL);
+			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, 0, NULL);
 			if (newarg)
 				argv[i] = newarg;
 		}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '

 	text="hallo there!\ncan you read me?" &&
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "\377\376"                         >test.utf16lebom.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&

 	# Line ending tests
 	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
@@ -32,7 +35,8 @@ test_expect_success 'setup test files' '
 	# Add only UTF-16 file, we will add the UTF-32 file later
 	cp test.utf16.raw test.utf16 &&
 	cp test.utf32.raw test.utf32 &&
-	git add .gitattributes test.utf16 &&
+	cp test.utf16lebom.raw test.utf16lebom &&
+	git add .gitattributes test.utf16 test.utf16lebom &&
 	git commit -m initial
 '

@@ -51,6 +55,12 @@ test_expect_success 're-encode to UTF-16 on checkout' '
 	test_cmp_bin test.utf16.raw test.utf16
 '

+test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
+	rm test.utf16lebom &&
+	git checkout test.utf16lebom &&
+	test_cmp_bin test.utf16lebom.raw test.utf16lebom
+'
+
 test_expect_success 'check $GIT_DIR/info/attributes support' '
 	test_when_finished "rm -f test.utf32.git" &&
 	test_when_finished "git reset --hard HEAD" &&
diff --git a/utf8.c b/utf8.c
index eb78587504..83824dc2f4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4,6 +4,11 @@

 /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */

+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
 struct interval {
 	ucs_char_t first;
 	ucs_char_t last;
@@ -470,16 +475,17 @@ int utf8_fprintf(FILE *stream, const char *format, ...)
 #else
 	typedef char * iconv_ibp;
 #endif
-char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, size_t *outsz_p)
+char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
+			    size_t bom_len, size_t *outsz_p)
 {
 	size_t outsz, outalloc;
 	char *out, *outpos;
 	iconv_ibp cp;

 	outsz = insz;
-	outalloc = st_add(outsz, 1); /* for terminating NUL */
+	outalloc = st_add(outsz, 1 + bom_len); /* for terminating NUL */
 	out = xmalloc(outalloc);
-	outpos = out;
+	outpos = out + bom_len;
 	cp = (iconv_ibp)in;

 	while (1) {
@@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
 {
 	iconv_t conv;
 	char *out;
+	const char *bom_str = NULL;
+	size_t bom_len = 0;

 	if (!in_encoding)
 		return NULL;

+	/* UTF-16LE-BOM is the same as UTF-16 for reading */
+	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
+		in_encoding = "UTF-16";
+
+	/*
+	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
+	 * Some users under Windows want the little endian version
+	 */
+	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
+	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	}
+
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1) {
 		in_encoding = fallback_encoding(in_encoding);
@@ -553,9 +579,10 @@ char *reencode_string_len(const char *in, size_t insz,
 		if (conv == (iconv_t) -1)
 			return NULL;
 	}
-
-	out = reencode_string_iconv(in, insz, conv, outsz);
+	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
 	iconv_close(conv);
+	if (out && bom_str && bom_len)
+		memcpy(out, bom_str, bom_len);
 	return out;
 }
 #endif
@@ -566,11 +593,6 @@ static int has_bom_prefix(const char *data, size_t len,
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
 }

-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
diff --git a/utf8.h b/utf8.h
index edea55e093..84efbfcb1f 100644
--- a/utf8.h
+++ b/utf8.h
@@ -27,7 +27,7 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,

 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
-			    iconv_t conv, size_t *outsz);
+			    iconv_t conv, size_t bom_len, size_t *outsz);
 char *reencode_string_len(const char *in, size_t insz,
 			  const char *out_encoding,
 			  const char *in_encoding,
--
2.20.1.2.gb21ebb671


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2019-01-20 16:43 ` [PATCH v2 " tboegi
@ 2019-01-22 20:13   ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2019-01-22 20:13 UTC (permalink / raw)
  To: tboegi; +Cc: git, adrigibal

tboegi@web.de writes:

> The unicode standard itself defines 3 possible ways how to encode UTF-16.
> a) UTF-16, without BOM, big endian:
> b) UTF-16, with BOM, little endian:
> c) UTF-16, with BOM, big endian:

Is it OK to interpret "possible" as "allowed" above?

> iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
>
> d) UTF-16
> $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> 0000000  376 377  \0   g  \0   i  \0   t

So among three, encoder can only do "big endian with BOM" (c).

Lack of (a) "big endian without BOM" in the encoder is not a problem
in practice, as you can ask UTF-16BE to produce the stream, tell the
decoder that you have UTF-16 and the lack of the BOM would make the
decoder take it as (a).

But lack of (b) "little endian with BOM" is a problem.

So the proposal is to invent UTF-16-[BL]E-BOM that prepends BOM in
front of UTF-16-[BL]E output to allow those who want (b).

Which makes sense, I guess.  I do find it a bit ugly in the sense
that it is something iconv should learn to do, as the issue is
shared with all applications that want to use libiconv and convert
into UTF-16.

Do you add UTF-16-BE-BOM for consistency?  It would be identical to
telling iconv to encode to UTF-16, if I understood your problem
description correctly.

> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index b8392fc330..4a88ab8be7 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -343,13 +343,13 @@ automatic line ending conversion based on your platform.
>  ------------------------
>
>  Use the following attributes if your '*.ps1' files are UTF-16 little
> -endian encoded without BOM and you want Git to use Windows line endings
> +endian encoded with BOM and you want Git to use Windows line endings
>  in the working directory. Please note, it is highly recommended to
>  explicitly define the line endings with `eol` if the `working-tree-encoding`
>  attribute is used to avoid ambiguity.
>
>  ------------------------
> -*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
> +*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
>  ------------------------

This change is robbing from those who do want a file without BOM to
give to those who do want a file with BOM.  Are the latter class of
people the majority of the intended readers (read: Windows folks)?

I wonder if the following, instead of the above hunk, would work better:

 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory. Please note, it is highly recommended to
+in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+you want UTF-16 little endian with BOM).
+Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`

> @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
>  {
>  	iconv_t conv;
>  	char *out;
> +	const char *bom_str = NULL;
> +	size_t bom_len = 0;
>
>  	if (!in_encoding)
>  		return NULL;
>
> +	/* UTF-16LE-BOM is the same as UTF-16 for reading */
> +	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
> +		in_encoding = "UTF-16";
> +
> +	/*
> +	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
> +	 * Some users under Windows want the little endian version
> +	 */
> +	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
> +		bom_str = utf16_le_bom;
> +		bom_len = sizeof(utf16_le_bom);
> +		out_encoding = "UTF-16LE";
> +	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";

OK, you do allow BE-BOM and the code does not rely on the fact that
iconv happens to produce it with "UTF-16", because the library is
free to switch between the three possible output (a)-(c) and we do
not want to get affected by such a switch.  Makes sense.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
                   ` (3 preceding siblings ...)
  2019-01-20 16:43 ` [PATCH v2 " tboegi
@ 2019-01-30 15:01 ` tboegi
  2019-01-30 15:24   ` Jason Pyeron
  2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
  5 siblings, 1 reply; 25+ messages in thread
From: tboegi @ 2019-01-30 15:01 UTC (permalink / raw)
  To: git, adrigibal; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

The unicode standard itself defines 3 allowed ways how to encode UTF-16.
The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:

a) UTF-16, without BOM, big endian:
$ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

b) UTF-16, with BOM, little endian:
$ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

c) UTF-16, with BOM, big endian:
$ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
0000000    g   i   t

Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
working tree.
After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
in the version (c) above.
This is what iconv generates, more details follow below.

iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:

d) UTF-16
$ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
0000000  376 377  \0   g  \0   i  \0   t

e) UTF-16LE
$ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
0000000    g  \0   i  \0   t  \0

f)  UTF-16BE
$ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
0000000   \0   g  \0   i  \0   t

There is no way to generate version (b) from above in a Git working tree,
but that is what some applications need.
(All fully unicode aware applications should be able to read all 3 variants,
but in practise we are not there yet).

When producing UTF-16 as an output, iconv generates the big endian version
with a BOM. (big endian is probably chosen for historical reasons).

iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
as encoding, and that file does not have a BOM.

Not all users (especially under Windows) are happy with this.
Some tools are not fully unicode aware and can only handle version (b).

Today there is no way to produce version (b) with iconv (or libiconv).
Looking into the history of iconv, it seems as if version (c) will
be used in all future iconv versions (for compatibility reasons).

Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
libiconv can not handle the encoding, so Git pick it up, handles the BOM
and uses libiconv to convert the rest of the stream.
(UTF-16BE-BOM is added for consistency)

Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

Changes since v2:
  Update the commit message (s/possible/allowed/)
  Update the documentation, as suggested by Junio:
  ...wonder if the following,
     instead of the above hunk, would work better..
  Yes, it does.

Documentation/gitattributes.txt  |  4 ++-
 compat/precompose_utf8.c         |  2 +-
 t/t0028-working-tree-encoding.sh | 12 ++++++++-
 utf8.c                           | 42 ++++++++++++++++++++++++--------
 utf8.h                           |  2 +-
 5 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..a2310fb920 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -344,7 +344,9 @@ automatic line ending conversion based on your platform.

 Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory. Please note, it is highly recommended to
+in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+you want UTF-16 little endian with BOM).
+Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
 		size_t namelen;
 		oldarg = argv[i];
 		if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, NULL);
+			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, 0, NULL);
 			if (newarg)
 				argv[i] = newarg;
 		}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@ test_expect_success 'setup test files' '

 	text="hallo there!\ncan you read me?" &&
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "\377\376"                         >test.utf16lebom.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&

 	# Line ending tests
 	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
@@ -32,7 +35,8 @@ test_expect_success 'setup test files' '
 	# Add only UTF-16 file, we will add the UTF-32 file later
 	cp test.utf16.raw test.utf16 &&
 	cp test.utf32.raw test.utf32 &&
-	git add .gitattributes test.utf16 &&
+	cp test.utf16lebom.raw test.utf16lebom &&
+	git add .gitattributes test.utf16 test.utf16lebom &&
 	git commit -m initial
 '

@@ -51,6 +55,12 @@ test_expect_success 're-encode to UTF-16 on checkout' '
 	test_cmp_bin test.utf16.raw test.utf16
 '

+test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
+	rm test.utf16lebom &&
+	git checkout test.utf16lebom &&
+	test_cmp_bin test.utf16lebom.raw test.utf16lebom
+'
+
 test_expect_success 'check $GIT_DIR/info/attributes support' '
 	test_when_finished "rm -f test.utf32.git" &&
 	test_when_finished "git reset --hard HEAD" &&
diff --git a/utf8.c b/utf8.c
index eb78587504..83824dc2f4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4,6 +4,11 @@

 /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */

+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
 struct interval {
 	ucs_char_t first;
 	ucs_char_t last;
@@ -470,16 +475,17 @@ int utf8_fprintf(FILE *stream, const char *format, ...)
 #else
 	typedef char * iconv_ibp;
 #endif
-char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, size_t *outsz_p)
+char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
+			    size_t bom_len, size_t *outsz_p)
 {
 	size_t outsz, outalloc;
 	char *out, *outpos;
 	iconv_ibp cp;

 	outsz = insz;
-	outalloc = st_add(outsz, 1); /* for terminating NUL */
+	outalloc = st_add(outsz, 1 + bom_len); /* for terminating NUL */
 	out = xmalloc(outalloc);
-	outpos = out;
+	outpos = out + bom_len;
 	cp = (iconv_ibp)in;

 	while (1) {
@@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t insz,
 {
 	iconv_t conv;
 	char *out;
+	const char *bom_str = NULL;
+	size_t bom_len = 0;

 	if (!in_encoding)
 		return NULL;

+	/* UTF-16LE-BOM is the same as UTF-16 for reading */
+	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
+		in_encoding = "UTF-16";
+
+	/*
+	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
+	 * Some users under Windows want the little endian version
+	 */
+	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
+	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	}
+
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1) {
 		in_encoding = fallback_encoding(in_encoding);
@@ -553,9 +579,10 @@ char *reencode_string_len(const char *in, size_t insz,
 		if (conv == (iconv_t) -1)
 			return NULL;
 	}
-
-	out = reencode_string_iconv(in, insz, conv, outsz);
+	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
 	iconv_close(conv);
+	if (out && bom_str && bom_len)
+		memcpy(out, bom_str, bom_len);
 	return out;
 }
 #endif
@@ -566,11 +593,6 @@ static int has_bom_prefix(const char *data, size_t len,
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
 }

-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
diff --git a/utf8.h b/utf8.h
index edea55e093..84efbfcb1f 100644
--- a/utf8.h
+++ b/utf8.h
@@ -27,7 +27,7 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,

 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
-			    iconv_t conv, size_t *outsz);
+			    iconv_t conv, size_t bom_len, size_t *outsz);
 char *reencode_string_len(const char *in, size_t insz,
 			  const char *out_encoding,
 			  const char *in_encoding,
--
2.20.1.2.gb21ebb671


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* RE: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2019-01-30 15:01 ` [PATCH v3 " tboegi
@ 2019-01-30 15:24   ` Jason Pyeron
  2019-01-30 17:49     ` Torsten Bögershausen
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Pyeron @ 2019-01-30 15:24 UTC (permalink / raw)
  To: tboegi, git, adrigibal

> -----Original Message-----
> From: git-owner@vger.kernel.org <git-owner@vger.kernel.org> On Behalf Of
> tboegi@web.de
> Sent: Wednesday, January 30, 2019 10:02 AM
> To: git@vger.kernel.org; adrigibal@gmail.com
> Cc: Torsten Bögershausen <tboegi@web.de>
> Subject: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
> 
> From: Torsten Bögershausen <tboegi@web.de>
> 
> Users who want UTF-16 files in the working tree set the .gitattributes
> like this:
> test.txt working-tree-encoding=UTF-16
> 
> The unicode standard itself defines 3 allowed ways how to encode UTF-16.
> The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
> 
> a) UTF-16, without BOM, big endian:
> $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> 0000000    g   i   t
> 
> b) UTF-16, with BOM, little endian:
> $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
> 0000000    g   i   t
> 
> c) UTF-16, with BOM, big endian:
> $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> 0000000    g   i   t
> 
> Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
> working tree.
> After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
> in the version (c) above.
> This is what iconv generates, more details follow below.
> 
> iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
> 
> d) UTF-16
> $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> 0000000  376 377  \0   g  \0   i  \0   t
> 
> e) UTF-16LE
> $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
> 0000000    g  \0   i  \0   t  \0
> 
> f)  UTF-16BE
> $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
> 0000000   \0   g  \0   i  \0   t
> 
> There is no way to generate version (b) from above in a Git working tree,
> but that is what some applications need.
> (All fully unicode aware applications should be able to read all 3
> variants,
> but in practise we are not there yet).
> 
> When producing UTF-16 as an output, iconv generates the big endian version
> with a BOM. (big endian is probably chosen for historical reasons).
> 
> iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
> as encoding, and that file does not have a BOM.
> 
> Not all users (especially under Windows) are happy with this.
> Some tools are not fully unicode aware and can only handle version (b).
> 
> Today there is no way to produce version (b) with iconv (or libiconv).
> Looking into the history of iconv, it seems as if version (c) will
> be used in all future iconv versions (for compatibility reasons).


Reading the RFC 2781 section 3.3:
 
   Text in the "UTF-16BE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in big-endian order.
   Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.

   Text in the "UTF-16LE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in little-endian order.
   Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.

I opened a bug with libiconv... https://savannah.gnu.org/bugs/index.php?55609

> 
> Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
> libiconv can not handle the encoding, so Git pick it up, handles the BOM
> and uses libiconv to convert the rest of the stream.
> (UTF-16BE-BOM is added for consistency)
> 
> Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
> 
> Changes since v2:
>   Update the commit message (s/possible/allowed/)
>   Update the documentation, as suggested by Junio:
>   ...wonder if the following,
>      instead of the above hunk, would work better..
>   Yes, it does.
> 
> Documentation/gitattributes.txt  |  4 ++-
>  compat/precompose_utf8.c         |  2 +-
>  t/t0028-working-tree-encoding.sh | 12 ++++++++-
>  utf8.c                           | 42 ++++++++++++++++++++++++--------
>  utf8.h                           |  2 +-
>  5 files changed, 48 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/gitattributes.txt
> b/Documentation/gitattributes.txt
> index b8392fc330..a2310fb920 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -344,7 +344,9 @@ automatic line ending conversion based on your
> platform.
> 
>  Use the following attributes if your '*.ps1' files are UTF-16 little
>  endian encoded without BOM and you want Git to use Windows line endings
> -in the working directory. Please note, it is highly recommended to
> +in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
> +you want UTF-16 little endian with BOM).
> +Please note, it is highly recommended to
>  explicitly define the line endings with `eol` if the `working-tree-
> encoding`
>  attribute is used to avoid ambiguity.
> 
> diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
> index de61c15d34..136250fbf6 100644
> --- a/compat/precompose_utf8.c
> +++ b/compat/precompose_utf8.c
> @@ -79,7 +79,7 @@ void precompose_argv(int argc, const char **argv)
>  		size_t namelen;
>  		oldarg = argv[i];
>  		if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
> -			newarg = reencode_string_iconv(oldarg, namelen,
> ic_precompose, NULL);
> +			newarg = reencode_string_iconv(oldarg, namelen,
> ic_precompose, 0, NULL);
>  			if (newarg)
>  				argv[i] = newarg;
>  		}
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-
> encoding.sh
> index 7e87b5a200..e58ecbfc44 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -11,9 +11,12 @@ test_expect_success 'setup test files' '
> 
>  	text="hallo there!\ncan you read me?" &&
>  	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
> +	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM"
> >>.gitattributes &&
>  	printf "$text" >test.utf8.raw &&
>  	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
>  	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
> +	printf "\377\376"                         >test.utf16lebom.raw &&
> +	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
> 
>  	# Line ending tests
>  	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
> @@ -32,7 +35,8 @@ test_expect_success 'setup test files' '
>  	# Add only UTF-16 file, we will add the UTF-32 file later
>  	cp test.utf16.raw test.utf16 &&
>  	cp test.utf32.raw test.utf32 &&
> -	git add .gitattributes test.utf16 &&
> +	cp test.utf16lebom.raw test.utf16lebom &&
> +	git add .gitattributes test.utf16 test.utf16lebom &&
>  	git commit -m initial
>  '
> 
> @@ -51,6 +55,12 @@ test_expect_success 're-encode to UTF-16 on checkout' '
>  	test_cmp_bin test.utf16.raw test.utf16
>  '
> 
> +test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
> +	rm test.utf16lebom &&
> +	git checkout test.utf16lebom &&
> +	test_cmp_bin test.utf16lebom.raw test.utf16lebom
> +'
> +
>  test_expect_success 'check $GIT_DIR/info/attributes support' '
>  	test_when_finished "rm -f test.utf32.git" &&
>  	test_when_finished "git reset --hard HEAD" &&
> diff --git a/utf8.c b/utf8.c
> index eb78587504..83824dc2f4 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -4,6 +4,11 @@
> 
>  /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */
> 
> +static const char utf16_be_bom[] = {'\xFE', '\xFF'};
> +static const char utf16_le_bom[] = {'\xFF', '\xFE'};
> +static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
> +static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> +
>  struct interval {
>  	ucs_char_t first;
>  	ucs_char_t last;
> @@ -470,16 +475,17 @@ int utf8_fprintf(FILE *stream, const char *format,
> ...)
>  #else
>  	typedef char * iconv_ibp;
>  #endif
> -char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
> size_t *outsz_p)
> +char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
> +			    size_t bom_len, size_t *outsz_p)
>  {
>  	size_t outsz, outalloc;
>  	char *out, *outpos;
>  	iconv_ibp cp;
> 
>  	outsz = insz;
> -	outalloc = st_add(outsz, 1); /* for terminating NUL */
> +	outalloc = st_add(outsz, 1 + bom_len); /* for terminating NUL */
>  	out = xmalloc(outalloc);
> -	outpos = out;
> +	outpos = out + bom_len;
>  	cp = (iconv_ibp)in;
> 
>  	while (1) {
> @@ -540,10 +546,30 @@ char *reencode_string_len(const char *in, size_t
> insz,
>  {
>  	iconv_t conv;
>  	char *out;
> +	const char *bom_str = NULL;
> +	size_t bom_len = 0;
> 
>  	if (!in_encoding)
>  		return NULL;
> 
> +	/* UTF-16LE-BOM is the same as UTF-16 for reading */
> +	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
> +		in_encoding = "UTF-16";
> +
> +	/*
> +	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
> +	 * Some users under Windows want the little endian version
> +	 */
> +	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
> +		bom_str = utf16_le_bom;
> +		bom_len = sizeof(utf16_le_bom);
> +		out_encoding = "UTF-16LE";
> +	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
> +		bom_str = utf16_be_bom;
> +		bom_len = sizeof(utf16_be_bom);
> +		out_encoding = "UTF-16BE";
> +	}
> +
>  	conv = iconv_open(out_encoding, in_encoding);
>  	if (conv == (iconv_t) -1) {
>  		in_encoding = fallback_encoding(in_encoding);
> @@ -553,9 +579,10 @@ char *reencode_string_len(const char *in, size_t
> insz,
>  		if (conv == (iconv_t) -1)
>  			return NULL;
>  	}
> -
> -	out = reencode_string_iconv(in, insz, conv, outsz);
> +	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
>  	iconv_close(conv);
> +	if (out && bom_str && bom_len)
> +		memcpy(out, bom_str, bom_len);
>  	return out;
>  }
>  #endif
> @@ -566,11 +593,6 @@ static int has_bom_prefix(const char *data, size_t
> len,
>  	return data && bom && (len >= bom_len) && !memcmp(data, bom,
> bom_len);
>  }
> 
> -static const char utf16_be_bom[] = {'\xFE', '\xFF'};
> -static const char utf16_le_bom[] = {'\xFF', '\xFE'};
> -static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
> -static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> -
>  int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
>  {
>  	return (
> diff --git a/utf8.h b/utf8.h
> index edea55e093..84efbfcb1f 100644
> --- a/utf8.h
> +++ b/utf8.h
> @@ -27,7 +27,7 @@ void strbuf_utf8_replace(struct strbuf *sb, int pos, int
> width,
> 
>  #ifndef NO_ICONV
>  char *reencode_string_iconv(const char *in, size_t insz,
> -			    iconv_t conv, size_t *outsz);
> +			    iconv_t conv, size_t bom_len, size_t *outsz);
>  char *reencode_string_len(const char *in, size_t insz,
>  			  const char *out_encoding,
>  			  const char *in_encoding,
> --
> 2.20.1.2.gb21ebb671
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
  2019-01-30 15:24   ` Jason Pyeron
@ 2019-01-30 17:49     ` Torsten Bögershausen
  0 siblings, 0 replies; 25+ messages in thread
From: Torsten Bögershausen @ 2019-01-30 17:49 UTC (permalink / raw)
  To: Jason Pyeron; +Cc: git, adrigibal

On Wed, Jan 30, 2019 at 10:24:44AM -0500, Jason Pyeron wrote:
> > -----Original Message-----
> > From: git-owner@vger.kernel.org <git-owner@vger.kernel.org> On Behalf Of
> > tboegi@web.de
> > Sent: Wednesday, January 30, 2019 10:02 AM
> > To: git@vger.kernel.org; adrigibal@gmail.com
> > Cc: Torsten Bögershausen <tboegi@web.de>
> > Subject: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
> >
> > From: Torsten Bögershausen <tboegi@web.de>
> >
> > Users who want UTF-16 files in the working tree set the .gitattributes
> > like this:
> > test.txt working-tree-encoding=UTF-16
> >
> > The unicode standard itself defines 3 allowed ways how to encode UTF-16.
> > The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
> >
> > a) UTF-16, without BOM, big endian:
> > $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > b) UTF-16, with BOM, little endian:
> > $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > c) UTF-16, with BOM, big endian:
> > $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
> > working tree.
> > After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
> > in the version (c) above.
> > This is what iconv generates, more details follow below.
> >
> > iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
> >
> > d) UTF-16
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> > 0000000  376 377  \0   g  \0   i  \0   t
> >
> > e) UTF-16LE
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
> > 0000000    g  \0   i  \0   t  \0
> >
> > f)  UTF-16BE
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
> > 0000000   \0   g  \0   i  \0   t
> >
> > There is no way to generate version (b) from above in a Git working tree,
> > but that is what some applications need.
> > (All fully unicode aware applications should be able to read all 3
> > variants,
> > but in practise we are not there yet).
> >
> > When producing UTF-16 as an output, iconv generates the big endian version
> > with a BOM. (big endian is probably chosen for historical reasons).
> >
> > iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
> > as encoding, and that file does not have a BOM.
> >
> > Not all users (especially under Windows) are happy with this.
> > Some tools are not fully unicode aware and can only handle version (b).
> >
> > Today there is no way to produce version (b) with iconv (or libiconv).
> > Looking into the history of iconv, it seems as if version (c) will
> > be used in all future iconv versions (for compatibility reasons).
>
>
> Reading the RFC 2781 section 3.3:
>
>    Text in the "UTF-16BE" charset MUST be serialized with the octets
>    which make up a single 16-bit UTF-16 value in big-endian order.
>    Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.
>
>    Text in the "UTF-16LE" charset MUST be serialized with the octets
>    which make up a single 16-bit UTF-16 value in little-endian order.
>    Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.
>
> I opened a bug with libiconv... https://savannah.gnu.org/bugs/index.php?55609
>

UTF-16 may be a), b) or c) from above.
Every unicode compliant system should be able to read all 3 of them.

When writing, the system/application/converter is free to choose one of those.
Probably out of historical reason, big endian is preferred (in iconv),
and to be helpful to systems/applications a BOM is written in the beginning.
This is according to the RFC, why do you think that this is a bug ?




^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v1 1/1] gitattributes.txt: fix typo
  2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
                   ` (4 preceding siblings ...)
  2019-01-30 15:01 ` [PATCH v3 " tboegi
@ 2019-03-06  5:23 ` tboegi
  2019-03-07  0:24   ` Junio C Hamano
  5 siblings, 1 reply; 25+ messages in thread
From: tboegi @ 2019-03-06  5:23 UTC (permalink / raw)
  To: git, ybhatambare; +Cc: Torsten Bögershausen

From: Yash Bhatambare <ybhatambare@gmail.com>

`UTF-16-LE-BOM` to `UTF-16LE-BOM`.

this closes https://github.com/git-for-windows/git/issues/2095

Signed-off-by: Yash Bhatambare <ybhatambare@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

This patch already made it into Git for Windows,
so I send it upstream "as is".

Documentation/gitattributes.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 9b41f81c06..bdd11a2ddd 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -346,7 +346,7 @@ automatic line ending conversion based on your platform.

 Use the following attributes if your '*.ps1' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
+in the working directory (use `UTF-16LE-BOM` instead of `UTF-16LE` if
 you want UTF-16 little endian with BOM).
 Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
--
2.19.1.593.gc670b1f876


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/1] gitattributes.txt: fix typo
  2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
@ 2019-03-07  0:24   ` Junio C Hamano
  0 siblings, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2019-03-07  0:24 UTC (permalink / raw)
  To: tboegi; +Cc: git, ybhatambare

tboegi@web.de writes:

>  Use the following attributes if your '*.ps1' files are UTF-16 little
>  endian encoded without BOM and you want Git to use Windows line endings
> -in the working directory (use `UTF-16-LE-BOM` instead of `UTF-16LE` if
> +in the working directory (use `UTF-16LE-BOM` instead of `UTF-16LE` if

Thanks for your attention to detail ;-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-03-07  0:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
2018-11-04 15:47 ` brian m. carlson
2018-11-04 16:37   ` Adrián Gimeno Balaguer
2018-11-04 18:38     ` brian m. carlson
2018-11-04 17:07 ` Torsten Bögershausen
2018-11-05  4:24   ` Adrián Gimeno Balaguer
2018-11-05 18:10     ` Torsten Bögershausen
2018-11-06 20:16       ` Torsten Bögershausen
2018-11-07  4:38         ` Adrián Gimeno Balaguer
2018-11-08 17:02           ` Torsten Bögershausen
2018-12-26  0:56             ` Alexandre Grigoriev
2018-12-26 19:25               ` brian m. carlson
2018-12-27  2:52                 ` Alexandre Grigoriev
2018-12-27 14:45                   ` Torsten Bögershausen
2018-12-23 14:46   ` Alexandre Grigoriev
2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
     [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
2018-12-29 15:48     ` Adrián Gimeno Balaguer
2018-12-29 17:54       ` Philip Oakley
2019-01-20 16:43 ` [PATCH v2 " tboegi
2019-01-22 20:13   ` Junio C Hamano
2019-01-30 15:01 ` [PATCH v3 " tboegi
2019-01-30 15:24   ` Jason Pyeron
2019-01-30 17:49     ` Torsten Bögershausen
2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
2019-03-07  0:24   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).