Re: UTF-BOM was: [PATCH] t2080: fix cp invocation...

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Torsten Bögershausen" <tboegi@web.de>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>,
	"Matheus Tavares" <matheus.bernardino@usp.br>,
	gitster@pobox.com, git@vger.kernel.org,
	"brian m . carlson" <sandals@crustytoothpaste.net>
Subject: Re: UTF-BOM was: [PATCH] t2080: fix cp invocation...
Date: Wed, 2 Jun 2021 21:13:44 +0200	[thread overview]
Message-ID: <20210602191344.assal4vuajln2sbz@tb-raspi4> (raw)
In-Reply-To: <87mts875d3.fsf@evledraar.gmail.com>

On Wed, Jun 02, 2021 at 03:36:57PM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> >> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> >> >> BOM:
> >> >>
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
> >> >>     efbbbf74657874
> >> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text
> >> >>     74657874
> >> >>
> >> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> >> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> >> >> per-se related to any change of yours.
> >> >
> >> > The 0xefbbbf looks interesting, it's BOM for utf-8.
> >> >
> >> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> >> >> the same failure.
> >> >
> >> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> >> > affects output of converting *to* utf-16 and utf-32.
> >> >
> >> > So, I think AIX iconv implementation automatically add BOM to utf-8?
> >> >
> >> > Perhap we need to call skip_utf8_bom somewhere?
> >>
> >> I debugged this a bit more, it's probably *also* an issue in our use of
> >> libiconv, but it goes wrong just with our test setup with
> >> iconv(1). I.e. on my boring linux box:
> >>
> >>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
> >>     $VAR1 = [
> >>               '0xff',
> >>               '0xfe',
> >>               '0x78',
> >>               '0x0',
> >>               '0xa',
> >>               '0x0'
> >>             ];
> >>
> >>
> >> On the AIX box to get the same I need to do that as:
> >>
> >>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> >
> > FWIW, my Linux with musl-libc also need to be done like this.
> >
> >> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> >> UTF-16, a plain UTF-16 gives you the big-endian version.
> >
> > Per spec, plain UTF-16 *is* big-endian. [1]
> >
> > 	In the table <BOM> indicates that the byte order is determined
> > 	by a byte order mark, if present at the beginning of the data
> > 	stream, otherwise it is big-endian.
> >
> >> To make things
> >> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> >> version. So it seems we can't get the same result at all for that one.
> >
> > Ditto for UTF-32
> >
> >> So from the outset the code added around 79444c92943 (utf8: handle
> >> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> >> careful (although this looked broken before), i.e. we should test exact
> >> known-good bytes and see if UTF-16 is really what we think it is,
> >> etc. This is likely broken on any big-endian non-GNUish iconv
> >> implementation.
> >
> > Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
> >
> > I still think we should strip UTF-8 BOM after reencode_string_len
> > I.e. something like this, I can't test this, though, since I don't have any AIX box.
> > And my Linux with musl-libc doesn't output BOM for utf-8
> > It doesn't write BOM for utf-16be and utf-32be, anyway.
> >
> > -----8<----
> > diff --git a/utf8.c b/utf8.c
> > index de4ce5c0e6..73631632bd 100644
> > --- a/utf8.c
> > +++ b/utf8.c
> > @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
> >  static const char utf16_le_bom[] = {'\xFF', '\xFE'};
> >  static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
> >  static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> > +const char utf8_bom[] = "\357\273\277";
> >
> >  struct interval {
> >  	ucs_char_t first;
> > @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
> >  	return p - s;
> >  }
> >
> > +static int has_utf8_bom(const char *text, size_t len)
> > +{
> > +	return len >= strlen(utf8_bom) &&
> > +		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
> > +}
> > +
> >  /* auxiliary function for binary search in interval table */
> >  static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
> >  {
> > @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
> >
> >  char *reencode_string_len(const char *in, size_t insz,
> >  			  const char *out_encoding, const char *in_encoding,
> > -			  size_t *outsz)
> > +			  size_t *outsz_p)
> >  {
> >  	iconv_t conv;
> >  	char *out;
> >  	const char *bom_str = NULL;
> >  	size_t bom_len = 0;
> > +	size_t outsz = 0;
> >
> >  	if (!in_encoding)
> >  		return NULL;
> > @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
> >  		if (conv == (iconv_t) -1)
> >  			return NULL;
> >  	}
> > -	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
> > +	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
> >  	iconv_close(conv);
> >  	if (out && bom_str && bom_len)
> >  		memcpy(out, bom_str, bom_len);
> > +	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
> > +		outsz -= strlen(utf8_bom);
> > +		memmove(out, out + strlen(utf8_bom), outsz + 1);
> > +	}
> > +	if (outsz_p)
> > +		*outsz_p = outsz;
> >  	return out;
> >  }
> >  #endif
> > @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
> >  	return is_hfs_dot_str(path, "mailmap");
> >  }
> >
> > -const char utf8_bom[] = "\357\273\277";
> > -
> >  int skip_utf8_bom(char **text, size_t len)
> >  {
> > -	if (len < strlen(utf8_bom) ||
> > -	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
> > +	if (!has_utf8_bom(*text, len))
> >  		return 0;
> >  	*text += strlen(utf8_bom);
> >  	return 1;
> > ---->8------
> >
> > 1: https://unicode.org/faq/utf_bom.html
>
> That's getting us there, now we don't fail on the 2nd test, but do start
> failing on the third "re-encode to UTF-16 on checkout" and other
> "checkout" tests.
>
> The "test_cmp" at the end of that 3rd tests shows that the difference in
> test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
> not the "test.utf16" file.

What I can read from all of this, is that "the iconv" does not handle BOMS
correcttly. When going from UTF-16 or UTF-32 to UTF-8 the BOM should be removed.
But that is not the case here, as it seams.
The patch from above for utf8.c in Git will fix this - OK so far.

For t2082-parallel-checkout-attributes.sh and t0028 we may be able to prepare
the "right" versions of the expected data on e.g. Linux box and add that material
to the test case.
Aad remove the invocation of a potential broken iconv binary completely from
the test scripts.

next prev parent reply	other threads:[~2021-06-02 19:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 23:58 [PATCH] t2080: fix cp invocation to copy symlinks instead of following them Matheus Tavares
2021-05-27  7:25 ` Christian Couder
2021-05-27 12:51 ` Ævar Arnfjörð Bjarmason
2021-05-31 14:01   ` Ævar Arnfjörð Bjarmason
2021-05-31 16:09     ` Matheus Tavares
2021-05-31 20:41       ` Ævar Arnfjörð Bjarmason
2021-06-02  1:36     ` Đoàn Trần Công Danh
2021-06-02 10:50       ` Ævar Arnfjörð Bjarmason
2021-06-02 11:14         ` Bagas Sanjaya
2021-06-02 11:22         ` Đoàn Trần Công Danh
2021-06-02 13:36           ` Ævar Arnfjörð Bjarmason
2021-06-02 13:50             ` Đoàn Trần Công Danh
2021-06-03 12:34               ` Đoàn Trần Công Danh
2021-06-02 19:13             ` Torsten Bögershausen [this message]
2021-06-03  0:07         ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210602191344.assal4vuajln2sbz@tb-raspi4 \
    --to=tboegi@web.de \
    --cc=avarab@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=matheus.bernardino@usp.br \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).