git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Philip Oakley <philipoakley@talktalk.net>
To: "Adrián Gimeno Balaguer" <adrigibal@gmail.com>,
	tboegi@web.de, git@vger.kernel.org
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"
Date: Sat, 29 Dec 2018 17:54:49 +0000	[thread overview]
Message-ID: <27bb049b-265f-7fec-06ef-5e8e29c2d7f7@talktalk.net> (raw)
In-Reply-To: <CADN+U_Mo4Ui-rmZe1+xoHOMA4koXGNpJ5XEGYoYZfYPGqP9VPQ@mail.gmail.com>

(adding Brian as cc who was in the original thread)

On 29/12/2018 15:48, Adrián Gimeno Balaguer wrote:
> Hello again.
> 
> I appreciate the grown interest in this issue.
> 
> Torsten, may I know what is the benefit on your code? My PR solved it
> by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
> is likely the shortest way:
> 
> https://github.com/git/git/pull/550/files

My main complaint with the PR would be the lack of documentation updates.

As the discussion has highlighted, whatever our solution, we will need 
to tell the users in plain and simple terms which parts of which 
standards are being used, and why we need to be somehow 'different'.

That is because a revision control system must be able to recover the 
original, for use in the original software tool, not just interpret it 
is some alternate form. The standards generally abdicate responsibility 
for the last step ;-)

I did not fully understand the conversion process you proposed, as I 
assumed(?) that on receipt of the source file, the Git conversion to 
utf-8 would convert the 16-bit BOM to the three byte utf-8 BOM byte 
sequence `EF BB BF` which has lost any knowledge of the original BE/LE 
coding.

Or, are we saying that the the 16-bit BOM is being interpreted as, a) 
the BE/LE indicator and b) a genuine "ZERO WIDTH NON-BREAKING SPACE" 
which is stored as the two byte utf-8 character code, again loosing 
(once stored as a blob object) the BE/LE indication.

Or, we see the BOM, note the endianness and then loose the BOM character 
when converting to utf-8. My ignorance of this step is starting to show. 
Regular users are probably even more confused, hence my hope for some 
documentation.

Given the above confusions, and many more when exploring the internet, 
the provision of a new, extra, clear, name for the encoding, as 
suggested by Torsten does offer an advantage in that it explicitly 
(rather than implicitly) makes plain what we are trying to do, without 
squeezing it in 'under the radar'.

That said, assuming an appropriate internal utf-8 Git coding that does 
remember the BE/LE state [if so how?] then the PR is a neat trick.

Torsten's patch also suffers from the lack of user facing documentation.

> 
> In order to make sure everything is clear, here is a case list of
> current Git behaviour and new one after my PR, regarding this issue.
> 
> Current behaviour:
> 
> - Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
> file with either UTF-16 BE or LE BOM, and comitting everything -> The
> file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
> the default system/libiconv endianness -> Problem (as long as user
> required the opposite endianness for any reason on his project). As a
> note, user can see however human-readable diffs on that file.
> 
> - Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
> working-tree-encoding=UTF-16BE' for a new test.txt file with either
> UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
> this because he requires that exact endianness, thus he writes it in
> order to attempt preserving it -> Git prohibites commiting it, also no
> human-readable diff is shown in the diff viewer/tool being used, but
> file is simply shown as binary.
> 
> New behaviour:
> 
> -  Just got too lazy to repeat it all over, read my PR description:
> https://github.com/git/git/pull/550

"In this PR: Git only prohibites the opposite BOM than the one in 
working-tree-encoding (e.g. if declared LE, then it denies BE BOM 
presence within the associated file, of the declared UTF-16/UTF-32). 
This way the user can now make Git operations which were previously 
impossible, with the only requisite being to match the endianness of 
working-tree-encoding attribute with the associated file/s."

> 
> - Git translations may need to be tweaked to in order to be consistent
> with new behaviour.
> 
> Thanks for your attention.
> 
-- 
Philip

  reply	other threads:[~2018-12-29 17:54 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
2018-11-04 15:47 ` brian m. carlson
2018-11-04 16:37   ` Adrián Gimeno Balaguer
2018-11-04 18:38     ` brian m. carlson
2018-11-04 17:07 ` Torsten Bögershausen
2018-11-05  4:24   ` Adrián Gimeno Balaguer
2018-11-05 18:10     ` Torsten Bögershausen
2018-11-06 20:16       ` Torsten Bögershausen
2018-11-07  4:38         ` Adrián Gimeno Balaguer
2018-11-08 17:02           ` Torsten Bögershausen
2018-12-26  0:56             ` Alexandre Grigoriev
2018-12-26 19:25               ` brian m. carlson
2018-12-27  2:52                 ` Alexandre Grigoriev
2018-12-27 14:45                   ` Torsten Bögershausen
2018-12-23 14:46   ` Alexandre Grigoriev
2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
     [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
2018-12-29 15:48     ` Adrián Gimeno Balaguer
2018-12-29 17:54       ` Philip Oakley [this message]
2019-01-20 16:43 ` [PATCH v2 " tboegi
2019-01-22 20:13   ` Junio C Hamano
2019-01-30 15:01 ` [PATCH v3 " tboegi
2019-01-30 15:24   ` Jason Pyeron
2019-01-30 17:49     ` Torsten Bögershausen
2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
2019-03-07  0:24   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=27bb049b-265f-7fec-06ef-5e8e29c2d7f7@talktalk.net \
    --to=philipoakley@talktalk.net \
    --cc=adrigibal@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).