git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Krzysztof Żelechowski" <giecrilj@stegny.2a.pl>,
	git@vger.kernel.org,
	"Hamza Mahfooz" <someguy@effective-light.com>
Subject: Re: *Really* noisy encoding warnings post-v2.33.0
Date: Sat, 09 Oct 2021 15:47:16 +0200	[thread overview]
Message-ID: <871r4umfnm.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YWEBmJk0aENR5Yeo@coredump.intra.peff.net>


On Fri, Oct 08 2021, Jeff King wrote:

> On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote:
>
>> If that were coupled with, say, an advise() call to explain that output
>> and matching might be inaccurate (and show that _once_), that might
>> might it more clear what's going on.
>> 
>> Now I am sympathetic to flooding the user with too many messages, and
>> maybe reducing this to a single instance of "some commit messages could
>> not be re-encoded; output and matching might be inaccurate" is the right
>> thing. But in a sense, it's also working as designed: what you asked for
>> is producing wrong output over and over, and Git is saying so.
>
> The single-output version would perhaps be something like this:
>
> diff --git a/pretty.c b/pretty.c
> index 708b618cfe..c86f41bae7 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding)
>  	return strbuf_detach(&tmp, NULL);
>  }
>  
> +static void show_encoding_warning(const char *output_encoding)
> +{
> +	static int seen_warning;
> +
> +	if (seen_warning)
> +		return;
> +
> +	seen_warning = 1;
> +	warning("one or more commits could not be re-encoded to '%s'",
> +		output_encoding);
> +	advise("When re-encoding fails, some output may be in an unexpected\n"
> +	       "encoding, and pattern matches against commit data may be\n"
> +	       "inaccurate.");
> +}
> +
>  const char *repo_logmsg_reencode(struct repository *r,
>  				 const struct commit *commit,
>  				 char **commit_encoding,
> @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * case we just return the commit message verbatim.
>  	 */
>  	if (!out) {
> -		warning("unable to reencode commit to '%s'", output_encoding);
> +		show_encoding_warning(output_encoding);
>  		return msg;
>  	}
>  	return out;

I'm not categorically opposed to having this warning stay, because I can
imagine an implementation of it that isn't so overmatching, but I think
neither one of us is going to work on that, so ....

We have the exact same edge case in the grep-a-file code, and it's a
delibirate decision to stay quiet about it.

Let's assume your pattern contains non-ASCII, you've asked for
locale-aware grepping, you want \d+ to mean all sorts of Unicode
fuzzyness instead of just [0-9] etc. (not yet implemented, PCRE_UCP).

It would still be annoying to see a warning every time you grep without
providing a pathspec that blacklists say the 100 '*.png' files that are
in your tree.

And that's a case where we *could* say that the user should mark them
with .gitattributes or whatever, but making every git user go through
that would be annoying to them, so we just do our best and silently fall
back.

Similarly with this, let's say I'm on an OS that likes UTF-16 better, as
some of our users do, I have the relevant settings to re-encode git.git
or linux.git. Now run:

    git -c i18n.logOutputEncoding=utf16 log --author=foobarbaz

And it's 2 warnings in git.git, and 157 in linux.git. Anyway, your
commit above makes that 1 in both cases, which is certainly a *lot*
better.

But I think similar to the grep-a-file case it's still way to much, now
just because I've got some old badly encoded commits in my history I'll
see one warning every time a log revision walk/grep comes across those.

On the "not categorically opposed" I think that this sort of warning
/might/ be good if:

 * It weren't enabled by default, or at least as a transition had
   something like a advise() message pointing at a fsck.skipList-like
   (or other instructions, replace?) about how to quiet it.

 * We're realy dumb with how we chain data->iconv->PCRE here. I.e. we'll
   whine that we can't reencode just to match my "foobarbaz", but we
   could just keep walking past bad bytes. We should ideally say "we
   might have matched your data, but *because* of the encoding failure
   we couldn't.

   We can easily know with something like "foobarbaz" that that's not
   the case.

Anyway, I think all of that we can leave for the future, because I'd
simply assumed that this was based on some report that had to do with
someone not matching with --grep or whatever because of the details of
the encoding going wrong, e.g. a string that's later in a commit
message, but a misencoded character tripped it up.

But in this case this seems to have been because someone tried to feed
"HTML" to it, which is not an encoding, and something iconv_open() has
(I daresay) always and will always error on. It returns -1 and sets
errno=EINVAL.

So having a warning or other detection in the revision loop seems
backwards to me, surely we want something like the below instead?
I.e. die as close to bad option parsing as possible?

Note that this will now die if we have NO_ICONV=Y, even with your patch,
that seems like a feature. Now we'll silently ignore it. I.e. we'll warn
because we failed to re-encode, but we're using a stub function whose
body is:

    { if (e) *e = 0; return NULL; }

So ditto the garbage encoding name we should have died a lot earlier.

Aside from your warning test the below makes tests in t4201-shortlog.sh
fail, but those just seem broken to me. I.e. they seem to rely on git
staying quiet if i18n.commitencoding is set to garbage.

diff --git a/environment.c b/environment.c
index 43bb1b35ffe..c26b18f8e5c 100644
--- a/environment.c
+++ b/environment.c
@@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)
 
 const char *get_log_output_encoding(void)
 {
-	return git_log_output_encoding ? git_log_output_encoding
+	const char *encoding = git_log_output_encoding ? git_log_output_encoding
 		: get_commit_output_encoding();
+#ifndef NO_ICONV
+	iconv_t conv;
+	conv = iconv_open(encoding, "UTF-8");
+	if (conv == (iconv_t) -1 && errno == EINVAL)
+		die_errno("the '%s' encoding is not known to iconv", encoding);
+#else
+	if (strcmp(encoding, "UTF-8"))
+		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
+#endif
+	return encoding;
 }
 
 const char *get_commit_output_encoding(void)

  reply	other threads:[~2021-10-09 14:33 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24  9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-24 10:31 ` Bagas Sanjaya
2021-08-24 10:33   ` Krzysztof Żelechowski
2021-08-24 10:46     ` Bagas Sanjaya
2021-08-24 19:11       ` Junio C Hamano
2021-08-25  0:57 ` Jeff King
2021-08-25 16:31   ` Junio C Hamano
2021-08-27 18:30     ` Jeff King
2021-08-27 18:32       ` Jeff King
2021-08-27 19:47         ` Junio C Hamano
2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
2021-10-09  2:36         ` Jeff King
2021-10-09  2:42           ` Jeff King
2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason [this message]
2021-10-27 11:03               ` Jeff King
2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason
2021-10-29 20:40                   ` Jeff King
2021-10-29 20:45                     ` Junio C Hamano
2021-10-29 20:52                       ` Junio C Hamano
2021-10-29 21:10                         ` Jeff King
2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
2021-10-10 13:53           ` Johannes Sixt
2021-10-10 15:43             ` Ævar Arnfjörð Bjarmason
2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-27 18:33     ` Jeff King
2021-08-25 23:28   ` Krzysztof Żelechowski
2021-08-25 23:47     ` Bryan Turner
2021-08-26 15:37       ` Junio C Hamano
2021-08-26 20:52         ` Krzysztof Żelechowski
2021-08-27 15:59           ` Junio C Hamano
2021-08-27 18:37             ` Jeff King
2021-08-27 21:51               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871r4umfnm.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=giecrilj@stegny.2a.pl \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).