[PATCH 0/2] Improve documentation on UTF-16

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/2] Improve documentation on UTF-16
@ 2018-12-27  2:17 brian m. carlson
  2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: brian m. carlson @ 2018-12-27  2:17 UTC (permalink / raw)
  To: git; +Cc: Lars Schneider, Torsten Bögershausen

We've recently fielded several reports from unhappy Windows users about
our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
suitable for certain Windows programs.

In an effort to communicate the reasons for our behavior more
effectively, explain in the documentation that the UTF-16 variant that
people have been asking for hasn't been standardized, and therefore
hasn't been implemented in iconv(3). Mention what each of the variants
do, so that people can make a decision which one meets their needs the
best.

In addition, add a comment in the code about why we must, for
correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
with U+FEFF, namely that such a codepoint semantically represents a
ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
sequence (as encoded in the object store) would be misinterpreted as a
BOM instead.

This comment is in the code because I think it needs to be somewhere,
but I'm not sure the documentation is the right place for it. If
desired, I can add it to the documentation, although I feel the lurid
details are not interesting to most users. If the wording is confusing,
I'm very open to hearing suggestions for how to improve it.

I don't use Windows, so I don't know what MSVCRT does. If it requires a
BOM but doesn't accept big-endian encoding, then perhaps we should
report that as a bug to Microsoft so it can be fixed in a future
version. That would probably make a lot more programs work right out of
the box and dramatically improve the user experience.

As a note, I'm currently on vacation through the 2nd, so my responses
may be slightly delayed.

brian m. carlson (2):
  Documentation: document UTF-16-related behavior
  utf8: add comment explaining why BOMs are rejected

 Documentation/gitattributes.txt | 5 +++++
 utf8.c                          | 7 +++++++
 2 files changed, 12 insertions(+)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] Documentation: document UTF-16-related behavior
  2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
@ 2018-12-27  2:17 ` brian m. carlson
  2018-12-27  2:17 ` [PATCH 2/2] utf8: add comment explaining why BOMs are rejected brian m. carlson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: brian m. carlson @ 2018-12-27  2:17 UTC (permalink / raw)
  To: git; +Cc: Lars Schneider, Torsten Bögershausen

There are a number of broken Windows programs which want to process
files in a UTF-16 variant that is always little endian and always
contains a BOM. Git cannot produce or accept such an encoding for the
working-tree-encoding because no such encoding has been defined with
IANA or implemented in iconv(3).

Document this behavior since it is a frequent source of confusion for
users. Additionally, document that specifying "UTF-16" may produce bytes
of either endianness, but will be sure to provide a BOM to distinguish.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitattributes.txt | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..2b2c93afd1 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -330,6 +330,11 @@ That operation will fail and cause an error.
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
+- It is not possible to specify a variant of UTF-16 with a BOM and a
+  specified endianness, because no such variants have been standardized.
+  Using "UTF-16" will produce a BOM with an unspecified endianness, and
+  using "UTF-16LE" or "UTF-16BE" will prohibit a BOM from being used.
+
 Use the `working-tree-encoding` attribute only if you cannot store a file
 in UTF-8 encoding and if you want Git to be able to process the content
 as text.

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] utf8: add comment explaining why BOMs are rejected
  2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
  2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
@ 2018-12-27  2:17 ` brian m. carlson
  2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
  2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 12+ messages in thread
From: brian m. carlson @ 2018-12-27  2:17 UTC (permalink / raw)
  To: git; +Cc: Lars Schneider, Torsten Bögershausen

A source of confusion for many Git users is why UTF-16LE and UTF-16BE do
not allow a BOM, instead treating it as a ZWNBSP, according to the
Unicode FAQ[0]. Explain in a comment why we cannot allow that to occur
due to our use of UTF-8 internally.

[0] https://unicode.org/faq/utf_bom.html#bom9

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 utf8.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/utf8.c b/utf8.c
index eb78587504..22af2c485a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -571,6 +571,13 @@ static const char utf16_le_bom[] = {'\xFF', '\xFE'};
 static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
 static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
 
+/*
+ * We check here for a forbidden BOM. When using UTF-16BE or UTF-16LE, a BOM is
+ * not allowed by RFC 2781, and any U+FEFF would be treated as a ZWNBSP, not a
+ * BOM. However, because we encode into UTF-8 internally, we cannot allow that
+ * character to occur as a ZWNBSP, since when encoded into UTF-8 it would be
+ * interpreted as a BOM.
+ */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
  2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
  2018-12-27  2:17 ` [PATCH 2/2] utf8: add comment explaining why BOMs are rejected brian m. carlson
@ 2018-12-27 10:06 ` Johannes Sixt
  2018-12-27 16:43   ` brian m. carlson
  2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 12+ messages in thread
From: Johannes Sixt @ 2018-12-27 10:06 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Lars Schneider, Torsten Bögershausen

Am 27.12.18 um 03:17 schrieb brian m. carlson:
> We've recently fielded several reports from unhappy Windows users about
> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> suitable for certain Windows programs.
> 
> In an effort to communicate the reasons for our behavior more
> effectively, explain in the documentation that the UTF-16 variant that
> people have been asking for hasn't been standardized, and therefore
> hasn't been implemented in iconv(3). Mention what each of the variants
> do, so that people can make a decision which one meets their needs the
> best.
> 
> In addition, add a comment in the code about why we must, for
> correctness reasons, reject a UTF-16LE or UTF-16BE sequence that begins
> with U+FEFF, namely that such a codepoint semantically represents a
> ZWNBSP, not a BOM, but that that codepoint at the beginning of a UTF-8
> sequence (as encoded in the object store) would be misinterpreted as a
> BOM instead.
> 
> This comment is in the code because I think it needs to be somewhere,
> but I'm not sure the documentation is the right place for it. If
> desired, I can add it to the documentation, although I feel the lurid
> details are not interesting to most users. If the wording is confusing,
> I'm very open to hearing suggestions for how to improve it.
> 
> I don't use Windows, so I don't know what MSVCRT does. If it requires a
> BOM but doesn't accept big-endian encoding, then perhaps we should
> report that as a bug to Microsoft so it can be fixed in a future
> version. That would probably make a lot more programs work right out of
> the box and dramatically improve the user experience.

It worries me that theoretical correctness is regarded higher than 
existing practice. I do not care a lot what some RFC tells what programs 
should do if the majority of the software does something different and 
that behavior has been proven useful in practice.

My understanding is that there is no such thing as a "byte order 
marker". It just so happens that when the first character in some UTF-16 
text file begins with a ZWNBSP, then it is possible to derive the 
endianness of the file automatically. Other then that, that very first 
code point U+FEFF *is part of the data* and must not be removed when the 
data is reencoded. If Git does something different, it is bogus, IMO.

-- Hannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
@ 2018-12-27 16:43   ` brian m. carlson
  2018-12-27 19:55     ` Johannes Sixt
  0 siblings, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2018-12-27 16:43 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Lars Schneider, Torsten Bögershausen

[-- Attachment #1: Type: text/plain, Size: 2629 bytes --]

On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
> It worries me that theoretical correctness is regarded higher than existing
> practice. I do not care a lot what some RFC tells what programs should do if
> the majority of the software does something different and that behavior has
> been proven useful in practice.

The majority of OSes produce the behavior I document here, and they are
the majority of systems on the Internet. Windows is the outlier here,
although a significant one. It is a common user of UTF-16 and its
variants, but so are Java and JavaScript, and they're present on a lot
of devices. Swallowing the U+FEFF would break compatibility with those
systems.

The issue that Windows users are seeing is that libiconv always produces
big-endian data for UTF-16, and they always want little-endian. glibc
produces native-endian data, which is what Windows users want. Git for
Windows could patch libiconv to do that (and that is the simple,
five-minute solution to this problem), but we'd still want to warn
people that they're relying on unspecified behavior, hence this series.

I would even be willing to patch Git for Windows's libiconv if somebody
could point me to the repo (although I obviously cannot test it, not
being a Windows user). I feel strongly, though, that fixing this is
outside of the scope of Git proper, and it's not a thing we should be
handling here.

> My understanding is that there is no such thing as a "byte order marker". It
> just so happens that when the first character in some UTF-16 text file
> begins with a ZWNBSP, then it is possible to derive the endianness of the
> file automatically. Other then that, that very first code point U+FEFF *is
> part of the data* and must not be removed when the data is reencoded. If Git
> does something different, it is bogus, IMO.

You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
the text, as would a second one be if we had two at the beginning of a
UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
U+FEFF, which has the wrong semantics.

To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
for the text) and then strip it off when we decode. That's kind of ugly,
and since iconv doesn't do that itself, we'd have to.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27 16:43   ` brian m. carlson
@ 2018-12-27 19:55     ` Johannes Sixt
  2018-12-27 23:45       ` brian m. carlson
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Sixt @ 2018-12-27 19:55 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Lars Schneider, Torsten Bögershausen

Am 27.12.18 um 17:43 schrieb brian m. carlson:
> On Thu, Dec 27, 2018 at 11:06:17AM +0100, Johannes Sixt wrote:
>> It worries me that theoretical correctness is regarded higher than existing
>> practice. I do not care a lot what some RFC tells what programs should do if
>> the majority of the software does something different and that behavior has
>> been proven useful in practice.
> 
> The majority of OSes produce the behavior I document here, and they are
> the majority of systems on the Internet. Windows is the outlier here,
> although a significant one. It is a common user of UTF-16 and its
> variants, but so are Java and JavaScript, and they're present on a lot
> of devices. Swallowing the U+FEFF would break compatibility with those
> systems.
> 
> The issue that Windows users are seeing is that libiconv always produces
> big-endian data for UTF-16, and they always want little-endian. glibc
> produces native-endian data, which is what Windows users want. Git for
> Windows could patch libiconv to do that (and that is the simple,
> five-minute solution to this problem), but we'd still want to warn
> people that they're relying on unspecified behavior, hence this series.
> 
> I would even be willing to patch Git for Windows's libiconv if somebody
> could point me to the repo (although I obviously cannot test it, not
> being a Windows user). I feel strongly, though, that fixing this is
> outside of the scope of Git proper, and it's not a thing we should be
> handling here.

Please appologize that I leave the majority of what you said uncommented 
as I am not deep in the matter and don't have a firm understanding of 
all the issues. I'll just trust what you said is sound.

Just one thing: Please do the count by *users* (or existing files or 
number of charactes exchanged or something similar); do not just count 
OSs; I mean, Windows is *not* the outlier if it handles 90% of the 
UTF-16 data in the world. (I'm just making up numbers here, but I think 
you get the point.)

>> My understanding is that there is no such thing as a "byte order marker". It
>> just so happens that when the first character in some UTF-16 text file
>> begins with a ZWNBSP, then it is possible to derive the endianness of the
>> file automatically. Other then that, that very first code point U+FEFF *is
>> part of the data* and must not be removed when the data is reencoded. If Git
>> does something different, it is bogus, IMO.
> 
> You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
> the text, as would a second one be if we had two at the beginning of a
> UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
> U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
> U+FEFF, which has the wrong semantics.
> 
> To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
> at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
> an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
> for the text) and then strip it off when we decode. That's kind of ugly,
> and since iconv doesn't do that itself, we'd have to.

But why do you add another U+FEFF on the way to UTF-8? There is one in 
the incoming UTF-16 data, and only *that* one must be converted. If 
there is no U+FEFF in the UTF-16 data, the should not be one in UTF-8, 
either. Puzzled...

-- Hannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27 19:55     ` Johannes Sixt
@ 2018-12-27 23:45       ` brian m. carlson
  2018-12-28  8:59         ` Johannes Sixt
  0 siblings, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2018-12-27 23:45 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Lars Schneider, Torsten Bögershausen

[-- Attachment #1: Type: text/plain, Size: 3455 bytes --]

On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
> Am 27.12.18 um 17:43 schrieb brian m. carlson:
> > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
> > the text, as would a second one be if we had two at the beginning of a
> > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
> > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
> > U+FEFF, which has the wrong semantics.
> > 
> > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
> > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
> > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
> > for the text) and then strip it off when we decode. That's kind of ugly,
> > and since iconv doesn't do that itself, we'd have to.
> 
> But why do you add another U+FEFF on the way to UTF-8? There is one in the
> incoming UTF-16 data, and only *that* one must be converted. If there is no
> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
> Puzzled...

So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
must not be a BOM. So if we do this:

  $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
  00000000: ff fe ff fe 0a 00                                ......

That U+FEFF we have in the input is part of the text as a ZWNBSP; it is
not a BOM. We end up with two U+FEFF values. The first is the BOM that's
required as part of UTF-16. The second is semantically part of the text
and has the semantics of a zero-width non-breaking space.

In UTF-8, if the sequence starts with U+FEFF, it has the semantics of a
BOM just like in UTF-16 (except that it's optional): it's not part of
the text, and should be stripped off. So when we receive a UTF-16LE or
UTF-16BE sequence and it contains a U+FEFF (which is part of the text),
we need to insert a BOM in front of the sequence that's part of the text
to keep the semantics.

Essentially, we have this situation:

Text (in memory):  U+FEFF U+000A
Semantics of text: ZWNBSP NL
UTF-16BE:          FE FF  00 0A
Semantics:         ZWNBSP NL
UTF-16:            FE FF FE FF  00 0A
Semantics:         BOM   ZWNBSP NL
UTF-8:             EF BB BF EF BB BF 0A
Semantics:         BOM      ZWNBSP   NL

If you don't have a U+FEFF, then things can be simpler:

Text (in memory):  U+0041 U+0042 U+0043
Semantics of text: A      B      C
UTF-16BE:          00 41 00 42 00 43
Semantics:         A     B     C
UTF-16:            FE FF 00 41 00 42 00 43
Semantics:         BOM   A     B     C
UTF-8:             41 42 43
Semantics:         A  B  C
UTF-8 (optional):  EF BB BF 41 42 43
Semantics:         BOM      A  B  C

(I have picked big-endian UTF-16 here, but little-endian is fine, too;
this is just easier for me to type.)

This is all a huge edge case involving correctly serializing code
points. By rejecting U=FEFF in UTF-16BE and UTF-16LE, we don't have to
deal with any of it.

As mentioned, I think patching Git for Windows's iconv is the smallest,
most achievable solution to this, because it means we don't have to
handle any of this edge case ourselves. Windows and WSL users can both
write "UTF-16" and get a BOM and little-endian behavior, while we can
delegate all the rest of the encoding stuff to libiconv.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
                   ` (2 preceding siblings ...)
  2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
@ 2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
  2018-12-28 20:35   ` Philip Oakley
  2018-12-29 23:17   ` brian m. carlson
  3 siblings, 2 replies; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-28  8:46 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Lars Schneider, Torsten Bögershausen


On Thu, Dec 27 2018, brian m. carlson wrote:

> We've recently fielded several reports from unhappy Windows users about
> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> suitable for certain Windows programs.

Just for context, is "we" here $DAYJOB or a reference to some previous
ML thread(s) on this list, or something else?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-27 23:45       ` brian m. carlson
@ 2018-12-28  8:59         ` Johannes Sixt
  2018-12-28 20:31           ` Philip Oakley
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Sixt @ 2018-12-28  8:59 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Lars Schneider, Torsten Bögershausen

Am 28.12.18 um 00:45 schrieb brian m. carlson:
> On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
>> But why do you add another U+FEFF on the way to UTF-8? There is one in the
>> incoming UTF-16 data, and only *that* one must be converted. If there is no
>> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
>> Puzzled...
> 
> So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
> must not be a BOM. So if we do this:
> 
>    $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
>    00000000: ff fe ff fe 0a 00                                ......

What sort of braindamage is this? Fix iconv.

But as I said, I'm not an expert. I just vented my worries that 
widespread existing practice would be ignored under the excuse "you are 
the outlier".

-- Hannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-28  8:59         ` Johannes Sixt
@ 2018-12-28 20:31           ` Philip Oakley
  0 siblings, 0 replies; 12+ messages in thread
From: Philip Oakley @ 2018-12-28 20:31 UTC (permalink / raw)
  To: Johannes Sixt, brian m. carlson
  Cc: git, Lars Schneider, Torsten Bögershausen

On 28/12/2018 08:59, Johannes Sixt wrote:
> Am 28.12.18 um 00:45 schrieb brian m. carlson:
>> On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
>>> But why do you add another U+FEFF on the way to UTF-8? There is one 
>>> in the
>>> incoming UTF-16 data, and only *that* one must be converted. If 
>>> there is no
>>> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
>>> Puzzled...
>>
>> So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
>> must not be a BOM. So if we do this:
>>
>>    $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
>>    00000000: ff fe ff fe 0a 00 ......
>
> What sort of braindamage is this? Fix iconv.
>
> But as I said, I'm not an expert. I just vented my worries that 
> widespread existing practice would be ignored under the excuse "you 
> are the outlier".
>
> -- Hannes

For ref, I dug out a Microsoft document [1] on its view of BOMs which 
can be compared to the ref [0] Brian gave

[1] 
https://docs.microsoft.com/en-us/windows/desktop/intl/using-byte-order-marks

[0] https://unicode.org/faq/utf_bom.html#bom9

Maybe the documentation patch ([PATCH 1/2] Documentation: document 
UTF-16-related behavior) should include the line ", because we encode 
into UTF-8 internally,", and a link to ref [0], and maybe [1]


Whether the various Windows programs actually follow the Microsoft 
convention is another matter altogether .

-- 

Philip



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
@ 2018-12-28 20:35   ` Philip Oakley
  2018-12-29 23:17   ` brian m. carlson
  1 sibling, 0 replies; 12+ messages in thread
From: Philip Oakley @ 2018-12-28 20:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, brian m. carlson
  Cc: git, Lars Schneider, Torsten Bögershausen

On 28/12/2018 08:46, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Dec 27 2018, brian m. carlson wrote:
>
>> We've recently fielded several reports from unhappy Windows users about
>> our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
>> suitable for certain Windows programs.
> Just for context, is "we" here $DAYJOB or a reference to some previous
> ML thread(s) on this list, or something else?


I think 
https://public-inbox.org/git/CADN+U_PUfnYWb-wW6drRANv-ZaYBEk3gWHc7oJtxohA5Vc3NEg@mail.gmail.com/ 
was the most recent on the Git list.

-- 

Philip


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Improve documentation on UTF-16
  2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
  2018-12-28 20:35   ` Philip Oakley
@ 2018-12-29 23:17   ` brian m. carlson
  1 sibling, 0 replies; 12+ messages in thread
From: brian m. carlson @ 2018-12-29 23:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Lars Schneider, Torsten Bögershausen

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Fri, Dec 28, 2018 at 09:46:18AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Dec 27 2018, brian m. carlson wrote:
> 
> > We've recently fielded several reports from unhappy Windows users about
> > our handling of UTF-16, UTF-16LE, and UTF-16BE, none of which seem to be
> > suitable for certain Windows programs.
> 
> Just for context, is "we" here $DAYJOB or a reference to some previous
> ML thread(s) on this list, or something else?

"We" in this case is the Git list. I think the list has seen at least
three threads in recent months.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-12-29 23:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
2018-12-27  2:17 ` [PATCH 2/2] utf8: add comment explaining why BOMs are rejected brian m. carlson
2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
2018-12-27 16:43   ` brian m. carlson
2018-12-27 19:55     ` Johannes Sixt
2018-12-27 23:45       ` brian m. carlson
2018-12-28  8:59         ` Johannes Sixt
2018-12-28 20:31           ` Philip Oakley
2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
2018-12-28 20:35   ` Philip Oakley
2018-12-29 23:17   ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).