git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Krzysztof Żelechowski" <giecrilj@stegny.2a.pl>,
	git@vger.kernel.org,
	"Hamza Mahfooz" <someguy@effective-light.com>
Subject: Re: *Really* noisy encoding warnings post-v2.33.0
Date: Fri, 29 Oct 2021 12:47:36 +0200	[thread overview]
Message-ID: <211029.86bl38w124.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <YXkx6WzoF+B1id5T@coredump.intra.peff.net>


On Wed, Oct 27 2021, Jeff King wrote:

> On Sat, Oct 09, 2021 at 03:47:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> But in this case this seems to have been because someone tried to feed
>> "HTML" to it, which is not an encoding, and something iconv_open() has
>> (I daresay) always and will always error on. It returns -1 and sets
>> errno=EINVAL.
>> 
>> So having a warning or other detection in the revision loop seems
>> backwards to me, surely we want something like the below instead?
>> I.e. die as close to bad option parsing as possible?
>
> Sorry for the slow response; this got thrown on my "to think about and
> look at later" pile.
>
> Yeah, I agree that if we sanity-checked the encoding up front, that
> would cover the case we saw in practice, and goes a long way towards
> catching any practical errors.
>
> But I think this patch is tricky:
>
>> diff --git a/environment.c b/environment.c
>> index 43bb1b35ffe..c26b18f8e5c 100644
>> --- a/environment.c
>> +++ b/environment.c
>> @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)
>>  
>>  const char *get_log_output_encoding(void)
>>  {
>> -	return git_log_output_encoding ? git_log_output_encoding
>> +	const char *encoding = git_log_output_encoding ? git_log_output_encoding
>>  		: get_commit_output_encoding();
>> +#ifndef NO_ICONV
>> +	iconv_t conv;
>> +	conv = iconv_open(encoding, "UTF-8");
>> +	if (conv == (iconv_t) -1 && errno == EINVAL)
>> +		die_errno("the '%s' encoding is not known to iconv", encoding);
>> +#else
>> +	if (strcmp(encoding, "UTF-8"))
>> +		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
>> +#endif
>> +	return encoding;
>>  }
>
> So one obvious problem here is that we call this function once per
> commit, so it's a lot of extra iconv_open() calls. But obviously we
> could use a static flag to do it once per process.

Yes, or the diff below, which seems like a better idea. I.e. stop
calling this in a loop if we know we'll need it, have setup_revisions()
populate it once.

> The other issue is that it is assuming UTF-8 on one end of the
> conversion. But we aren't necessarily doing such a conversion; it
> depends on the commit's on-disk encoding, and the requested output
> encoding. In particular:
>
>   - if both of those match, we do not need to call iconv at all (see the
>     same_encoding() check in repo_logmsg_reencode()). With the patch
>     above, the NO_ICONV case would start to die() when both are say
>     iso8859-1, even though it currently works.
>
>   - likewise, even if you have iconv support, it's possible that your
>     preferred encoding is not compatible with utf8. In which case
>     iconv_open() may complain, even though the actual conversion we'd
>     ask it to do would succeed.
>
> I.e., I don't think there's a way to just ask iconv "does this encoding
> name by itself make any sense". You can only ask it about to/from
> combos.

Yes, I'm not saying it covers the general problem, but that it covers
the specific complained-about issue of a completely nonsensical encoding
like "HTML". We should simply error on that on command startup, whether
or not we have any commits to visit.

But yes, if we want to cover specific encoding issues, e.g. not being
able to squash CJK UTF-8 into US-ASCII that needs to be per-commit,
but...

> So I think a much better version of this is to catch the _actual_
> iconv_open() call we make. And if it fails, say "woah, this combo of
> encodings isn't supported". The reason I didn't do that in the earlier
> patch is that all of this is obscured inside reencode_string_len(),
> which does both the iconv_open() and the iconv() call. We could surface
> that error information.
>
> But I'm not sure it would make sense to die() in that case. While for
> something like "git log --encoding=nonsense" every commit is going to
> fail to re-encode, it's still possible that iconv_open() failures are
> commit-specific. I.e., you could have some garbage commit in your
> history with an unsupported encoding, and you wouldn't want to die() for
> it (it's the same case you are complaining about having a warning for,
> but much worse).
>
> I suspect the best we could do along these lines is to wait until a real
> iconv_open(to, from) fails, and then as a fallback try:
>
>   iconv_open("UTF-8", from);
>   iconv_open(to, "UTF-8");
>
> to sanity-check them individually, and guess that one of them is broken
> if it can't go to/from UTF-8. But even that feels like it's making
> assumptions about both the system iconv, and the charsets people use.
>
> To be clear, I'd expect that most people just use utf-8 in the first
> place, and even if they don't that their system has some basic utf-8
> support. But we are deep into the realm of weird corner cases here, and
> the utility of this warning / error-checking doesn't seem high enough to
> merit the possible regressions we'd get by trying to make too many
> assumptions.

I still think the right move here is to just revert your patch. Yes we
can think of a bunch of tricky edge cases that need to be fixed, but
this is a long-standing problem, and the change as-is has some pretty
bad UI problems.

I think any change that solves those (probably starting with the below)
is going to be relatively major this close to release.

So per <87ily7m1mv.fsf@evledraar.gmail.com> why can't we just revert the
warning(), and then consider a good way forward that covers some/all of
these cases we've noted?

Which should probably start with the below diff, so we can clearly
distinguish startup bad encoding names from runtime ones (we'd put that
proposed "die if iconv_open() doesn't like it" somewhere in
setup_revisions()).

diff --git a/builtin/log.c b/builtin/log.c
index f75d87e8d7f..2b3a607e947 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -539,7 +539,7 @@ static void show_tagger(const char *buf, struct rev_info *rev)
 
 	pp.fmt = rev->commit_format;
 	pp.date_mode = rev->date_mode;
-	pp_user_info(&pp, "Tagger", &out, buf, get_log_output_encoding());
+	pp_user_info(&pp, "Tagger", &out, buf, rev->log_output_encoding);
 	fprintf(rev->diffopt.file, "%s", out.buf);
 	strbuf_release(&out);
 }
@@ -1208,7 +1208,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 	log.file = rev->diffopt.file;
 	log.groups = SHORTLOG_GROUP_AUTHOR;
 	for (i = 0; i < nr; i++)
-		shortlog_add_commit(&log, list[i]);
+		shortlog_add_commit(rev, &log, list[i]);
 
 	shortlog_output(&log);
 
diff --git a/builtin/merge.c b/builtin/merge.c
index ea3112e0c0b..1a8fbafc149 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -431,6 +431,7 @@ static void squash_message(struct commit *commit, struct commit_list *remotehead
 	ctx.abbrev = rev.abbrev;
 	ctx.date_mode = rev.date_mode;
 	ctx.fmt = rev.commit_format;
+	ctx.output_encoding = rev.log_output_encoding;
 
 	strbuf_addstr(&out, "Squashed commit of the following:\n");
 	while ((commit = get_revision(&rev)) != NULL) {
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 36cb909ebaa..905ba7462f3 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -165,7 +165,7 @@ static void show_commit(struct commit *commit, void *data)
 		ctx.date_mode = revs->date_mode;
 		ctx.date_mode_explicit = revs->date_mode_explicit;
 		ctx.fmt = revs->commit_format;
-		ctx.output_encoding = get_log_output_encoding();
+		ctx.output_encoding = revs->log_output_encoding;
 		ctx.color = revs->diffopt.use_color;
 		pretty_print_commit(&ctx, commit, &buf);
 		if (buf.len) {
diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index e7f7af5de3f..2e5409a4fcc 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -198,7 +198,8 @@ static void insert_records_from_trailers(struct shortlog *log,
 	unuse_commit_buffer(commit, commit_buffer);
 }
 
-void shortlog_add_commit(struct shortlog *log, struct commit *commit)
+void shortlog_add_commit(struct rev_info *rev, struct shortlog *log,
+			 struct commit *commit)
 {
 	struct strbuf ident = STRBUF_INIT;
 	struct strbuf oneline = STRBUF_INIT;
@@ -210,7 +211,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	ctx.abbrev = log->abbrev;
 	ctx.print_email_subject = 1;
 	ctx.date_mode.type = DATE_NORMAL;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = rev->log_output_encoding;
 
 	if (!log->summary) {
 		if (log->user_format)
@@ -254,7 +255,7 @@ static void get_from_rev(struct rev_info *rev, struct shortlog *log)
 	if (prepare_revision_walk(rev))
 		die(_("revision walk setup failed"));
 	while ((commit = get_revision(rev)) != NULL)
-		shortlog_add_commit(log, commit);
+		shortlog_add_commit(rev, log, commit);
 }
 
 static int parse_uint(char const **arg, int comma, int defval)
diff --git a/bundle.c b/bundle.c
index a0bb687b0f4..32d74ab0bee 100644
--- a/bundle.c
+++ b/bundle.c
@@ -448,6 +448,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
 
 struct bundle_prerequisites_info {
 	struct object_array *pending;
+	struct rev_info *rev;
 	int fd;
 };
 
@@ -464,7 +465,7 @@ static void write_bundle_prerequisites(struct commit *commit, void *data)
 	write_or_die(bpi->fd, buf.buf, buf.len);
 
 	ctx.fmt = CMIT_FMT_ONELINE;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = bpi->rev->log_output_encoding;
 	strbuf_reset(&buf);
 	pretty_print_commit(&ctx, commit, &buf);
 	strbuf_trim(&buf);
@@ -544,6 +545,7 @@ int create_bundle(struct repository *r, const char *path,
 		die("revision walk setup failed");
 	bpi.fd = bundle_fd;
 	bpi.pending = &revs_copy.pending;
+	bpi.rev = &revs;
 	traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
 	object_array_remove_duplicates(&revs_copy.pending);
 
diff --git a/log-tree.c b/log-tree.c
index 644893fd8cf..d79324d5bfd 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -737,7 +737,7 @@ void show_log(struct rev_info *opt)
 
 		raw = (opt->commit_format == CMIT_FMT_USERFORMAT);
 		format_display_notes(&commit->object.oid, &notebuf,
-				     get_log_output_encoding(), raw);
+				     opt->log_output_encoding, raw);
 		ctx.notes_message = strbuf_detach(&notebuf, NULL);
 	}
 
@@ -758,7 +758,7 @@ void show_log(struct rev_info *opt)
 	ctx.mailmap = opt->mailmap;
 	ctx.color = opt->diffopt.use_color;
 	ctx.expand_tabs_in_log = opt->expand_tabs_in_log;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = opt->log_output_encoding;
 	ctx.rev = opt;
 	if (opt->from_ident.mail_begin && opt->from_ident.name_begin)
 		ctx.from_ident = &opt->from_ident;
diff --git a/pretty.c b/pretty.c
index fe95107ae5a..6eb64e8189d 100644
--- a/pretty.c
+++ b/pretty.c
@@ -2048,15 +2048,17 @@ void pretty_print_commit(struct pretty_print_context *pp,
 	int indent = 4;
 	const char *msg;
 	const char *reencoded;
-	const char *encoding;
+	const char *encoding = pp->output_encoding;
 	int need_8bit_cte = pp->need_8bit_cte;
 
+	if (!encoding)
+		BUG("should have .output_encoding");
+
 	if (pp->fmt == CMIT_FMT_USERFORMAT) {
 		format_commit_message(commit, user_format, sb, pp);
 		return;
 	}
 
-	encoding = get_log_output_encoding();
 	msg = reencoded = logmsg_reencode(commit, NULL, encoding);
 
 	if (pp->fmt == CMIT_FMT_ONELINE || cmit_fmt_is_mail(pp->fmt))
@@ -2121,6 +2123,14 @@ void pp_commit_easy(enum cmit_fmt fmt, const struct commit *commit,
 		    struct strbuf *sb)
 {
 	struct pretty_print_context pp = {0};
+	static const char *output_encoding;
+
+	if (!output_encoding) {
+		const char *tmp = get_log_output_encoding();
+		output_encoding = tmp;
+	}
+	
 	pp.fmt = fmt;
+	pp.output_encoding = output_encoding;
 	pretty_print_commit(&pp, commit, sb);
 }
diff --git a/remote-curl.c b/remote-curl.c
index d69156312bd..733a525bc73 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -345,8 +345,14 @@ static void free_discovery(struct discovery *d)
 static int show_http_message(struct strbuf *type, struct strbuf *charset,
 			     struct strbuf *msg)
 {
+	static const char *output_encoding;
 	const char *p, *eol;
 
+	if (!output_encoding) {
+		const char *tmp = get_log_output_encoding();
+		output_encoding = tmp;
+	}
+
 	/*
 	 * We only show text/plain parts, as other types are likely
 	 * to be ugly to look at on the user's terminal.
@@ -354,7 +360,7 @@ static int show_http_message(struct strbuf *type, struct strbuf *charset,
 	if (strcmp(type->buf, "text/plain"))
 		return -1;
 	if (charset->len)
-		strbuf_reencode(msg, charset->buf, get_log_output_encoding());
+		strbuf_reencode(msg, charset->buf, output_encoding);
 
 	strbuf_trim(msg);
 	if (!msg->len)
diff --git a/revision.c b/revision.c
index ab7c1358042..0b2ad87f28e 100644
--- a/revision.c
+++ b/revision.c
@@ -2866,7 +2866,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 				 &revs->grep_filter);
-	if (!is_encoding_utf8(get_log_output_encoding()))
+	revs->log_output_encoding = get_log_output_encoding();
+	if (!is_encoding_utf8(revs->log_output_encoding))
 		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
diff --git a/revision.h b/revision.h
index 5578bb4720a..bf9dbca727e 100644
--- a/revision.h
+++ b/revision.h
@@ -237,6 +237,7 @@ struct rev_info {
 	struct string_list *ref_message_ids;
 	int		add_signoff;
 	const char	*extra_headers;
+	const char	*log_output_encoding;
 	const char	*log_reencode;
 	const char	*subject_prefix;
 	int		patch_name_max;
diff --git a/shortlog.h b/shortlog.h
index 3f7e9aabcae..f809617f8a0 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -30,7 +30,9 @@ struct shortlog {
 
 void shortlog_init(struct shortlog *log);
 
-void shortlog_add_commit(struct shortlog *log, struct commit *commit);
+struct rev_info;
+void shortlog_add_commit(struct rev_info *rev, struct shortlog *log,
+			 struct commit *commit);
 
 void shortlog_output(struct shortlog *log);
 
diff --git a/submodule.c b/submodule.c
index c6890705241..f10e9c34ff6 100644
--- a/submodule.c
+++ b/submodule.c
@@ -501,7 +501,7 @@ static void print_submodule_diff_summary(struct repository *r, struct rev_info *
 	while ((commit = get_revision(rev))) {
 		struct pretty_print_context ctx = {0};
 		ctx.date_mode = rev->date_mode;
-		ctx.output_encoding = get_log_output_encoding();
+		ctx.output_encoding = rev->log_output_encoding;
 		strbuf_setlen(&sb, 0);
 		repo_format_commit_message(r, commit, format, &sb,
 				      &ctx);

  reply	other threads:[~2021-10-29 10:57 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24  9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-24 10:31 ` Bagas Sanjaya
2021-08-24 10:33   ` Krzysztof Żelechowski
2021-08-24 10:46     ` Bagas Sanjaya
2021-08-24 19:11       ` Junio C Hamano
2021-08-25  0:57 ` Jeff King
2021-08-25 16:31   ` Junio C Hamano
2021-08-27 18:30     ` Jeff King
2021-08-27 18:32       ` Jeff King
2021-08-27 19:47         ` Junio C Hamano
2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
2021-10-09  2:36         ` Jeff King
2021-10-09  2:42           ` Jeff King
2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
2021-10-27 11:03               ` Jeff King
2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason [this message]
2021-10-29 20:40                   ` Jeff King
2021-10-29 20:45                     ` Junio C Hamano
2021-10-29 20:52                       ` Junio C Hamano
2021-10-29 21:10                         ` Jeff King
2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
2021-10-10 13:53           ` Johannes Sixt
2021-10-10 15:43             ` Ævar Arnfjörð Bjarmason
2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-27 18:33     ` Jeff King
2021-08-25 23:28   ` Krzysztof Żelechowski
2021-08-25 23:47     ` Bryan Turner
2021-08-26 15:37       ` Junio C Hamano
2021-08-26 20:52         ` Krzysztof Żelechowski
2021-08-27 15:59           ` Junio C Hamano
2021-08-27 18:37             ` Jeff King
2021-08-27 21:51               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=211029.86bl38w124.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=giecrilj@stegny.2a.pl \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).