On 2021-06-02 at 10:50:53, Ævar Arnfjörð Bjarmason wrote:
> I debugged this a bit more, it's probably *also* an issue in our use of
> libiconv, but it goes wrong just with our test setup with
> iconv(1). I.e. on my boring linux box:
>     
>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>     $VAR1 = [
>               '0xff',
>               '0xfe',
>               '0x78',
>               '0x0',
>               '0xa',
>               '0x0'
>             ];
> 

This is a little-endian encoding of UTF-16 with a BOM.  The BOM is
required here since the default, if no BOM is provided, is big endian.
However, as I alluded to in 79444c92943, while the standard permits the
BOM to be omitted, doing so is generally improvident because that leads
to breakage when interoperating with Windows machines, many programs for
which assume little endian.

I mean, I don't use Windows and I think those programs are broken and
their authors rightfully should have known better, but practically,
using a BOM solves the problem easily, and if we can be slightly nicer
to the poor, hapless users of those programs, why not?

> On the AIX box to get the same I need to do that as:
> 
>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> 
> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> UTF-16, a plain UTF-16 gives you the big-endian version. To make things
> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> version. So it seems we can't get the same result at all for that one.

But what do you get if you just use UTF-16?  Is it little endian with
BOM, big endian with BOM, or big endian without BOM?  If it's big endian
without BOM, did you set ICONV_OMITS_BOM when building?

> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.

We probably could have been more careful here.  Part of the problem is
that I don't have access to any affected systems here, so it's not in
general easy for me to write a test (or even a patch) for this case.

We also did use iconv(1) before that, but I _think_ it's possible to
remove it.  The thing that's tricky is the use of SHIFT-JIS, which has
known round-tripping problems, but I don't think we rely on using the
system iconv(3) there and encoding any valid SHIFT-JIS sequence is
probably fine.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US