Re: [PATCH 0/2] Improve documentation on UTF-16

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Johannes Sixt <j6t@kdbg.org>
Cc: git@vger.kernel.org, "Lars Schneider" <larsxschneider@gmail.com>,
	"Torsten Bögershausen" <tboegi@web.de>
Subject: Re: [PATCH 0/2] Improve documentation on UTF-16
Date: Thu, 27 Dec 2018 16:43:53 +0000	[thread overview]
Message-ID: <20181227164353.GC423984@genre.crustytoothpaste.net> (raw)
In-Reply-To: <93f0a854-9b8d-500c-b015-59c50ecdb0f3@kdbg.org>

[-- Attachment #1: Type: text/plain, Size: 2629 bytes --]

On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
> It worries me that theoretical correctness is regarded higher than existing
> practice. I do not care a lot what some RFC tells what programs should do if
> the majority of the software does something different and that behavior has
> been proven useful in practice.

The majority of OSes produce the behavior I document here, and they are
the majority of systems on the Internet. Windows is the outlier here,
although a significant one. It is a common user of UTF-16 and its
variants, but so are Java and JavaScript, and they're present on a lot
of devices. Swallowing the U+FEFF would break compatibility with those
systems.

The issue that Windows users are seeing is that libiconv always produces
big-endian data for UTF-16, and they always want little-endian. glibc
produces native-endian data, which is what Windows users want. Git for
Windows could patch libiconv to do that (and that is the simple,
five-minute solution to this problem), but we'd still want to warn
people that they're relying on unspecified behavior, hence this series.

I would even be willing to patch Git for Windows's libiconv if somebody
could point me to the repo (although I obviously cannot test it, not
being a Windows user). I feel strongly, though, that fixing this is
outside of the scope of Git proper, and it's not a thing we should be
handling here.

> My understanding is that there is no such thing as a "byte order marker". It
> just so happens that when the first character in some UTF-16 text file
> begins with a ZWNBSP, then it is possible to derive the endianness of the
> file automatically. Other then that, that very first code point U+FEFF *is
> part of the data* and must not be removed when the data is reencoded. If Git
> does something different, it is bogus, IMO.

You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
the text, as would a second one be if we had two at the beginning of a
UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
U+FEFF, which has the wrong semantics.

To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
for the text) and then strip it off when we decode. That's kind of ugly,
and since iconv doesn't do that itself, we'd have to.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

next prev parent reply	other threads:[~2018-12-27 16:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
2018-12-27  2:17 ` [PATCH 2/2] utf8: add comment explaining why BOMs are rejected brian m. carlson
2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
2018-12-27 16:43   ` brian m. carlson [this message]
2018-12-27 19:55     ` Johannes Sixt
2018-12-27 23:45       ` brian m. carlson
2018-12-28  8:59         ` Johannes Sixt
2018-12-28 20:31           ` Philip Oakley
2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
2018-12-28 20:35   ` Philip Oakley
2018-12-29 23:17   ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181227164353.GC423984@genre.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=git@vger.kernel.org \
    --cc=j6t@kdbg.org \
    --cc=larsxschneider@gmail.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).