On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote: > Am 27.12.18 um 17:43 schrieb brian m. carlson: > > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of > > the text, as would a second one be if we had two at the beginning of a > > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a > > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one > > U+FEFF, which has the wrong semantics. > > > > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF > > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode > > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one > > for the text) and then strip it off when we decode. That's kind of ugly, > > and since iconv doesn't do that itself, we'd have to. > > But why do you add another U+FEFF on the way to UTF-8? There is one in the > incoming UTF-16 data, and only *that* one must be converted. If there is no > U+FEFF in the UTF-16 data, the should not be one in UTF-8, either. > Puzzled... So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there must not be a BOM. So if we do this: $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1 00000000: ff fe ff fe 0a 00 ...... That U+FEFF we have in the input is part of the text as a ZWNBSP; it is not a BOM. We end up with two U+FEFF values. The first is the BOM that's required as part of UTF-16. The second is semantically part of the text and has the semantics of a zero-width non-breaking space. In UTF-8, if the sequence starts with U+FEFF, it has the semantics of a BOM just like in UTF-16 (except that it's optional): it's not part of the text, and should be stripped off. So when we receive a UTF-16LE or UTF-16BE sequence and it contains a U+FEFF (which is part of the text), we need to insert a BOM in front of the sequence that's part of the text to keep the semantics. Essentially, we have this situation: Text (in memory): U+FEFF U+000A Semantics of text: ZWNBSP NL UTF-16BE: FE FF 00 0A Semantics: ZWNBSP NL UTF-16: FE FF FE FF 00 0A Semantics: BOM ZWNBSP NL UTF-8: EF BB BF EF BB BF 0A Semantics: BOM ZWNBSP NL If you don't have a U+FEFF, then things can be simpler: Text (in memory): U+0041 U+0042 U+0043 Semantics of text: A B C UTF-16BE: 00 41 00 42 00 43 Semantics: A B C UTF-16: FE FF 00 41 00 42 00 43 Semantics: BOM A B C UTF-8: 41 42 43 Semantics: A B C UTF-8 (optional): EF BB BF 41 42 43 Semantics: BOM A B C (I have picked big-endian UTF-16 here, but little-endian is fine, too; this is just easier for me to type.) This is all a huge edge case involving correctly serializing code points. By rejecting U=FEFF in UTF-16BE and UTF-16LE, we don't have to deal with any of it. As mentioned, I think patching Git for Windows's iconv is the smallest, most achievable solution to this, because it means we don't have to handle any of this edge case ourselves. Windows and WSL users can both write "UTF-16" and get a BOM and little-endian behavior, while we can delegate all the rest of the encoding stuff to libiconv. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204