git log --encoding=HTML is not supported

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git log --encoding=HTML is not supported
@ 2021-08-24  9:00 Krzysztof Żelechowski
  2021-08-24 10:31 ` Bagas Sanjaya
  2021-08-25  0:57 ` Jeff King
  0 siblings, 2 replies; 33+ messages in thread
From: Krzysztof Żelechowski @ 2021-08-24  9:00 UTC (permalink / raw)
  To: git

Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem)
{ git log --oneline --encoding=HTML stl_function.h; }

Co powinno się stać? (Oczekiwane zachowanie)
828176ba490 libstdc++: Improve doxygen comments in &lt;bits/stl_function.h>

Co stało się zamiast tego? (Rzeczywiste zachowanie)
828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h>

Jaka jest różnica między tym, co powinno się stać, a tym, co się stało?
Znak początku nazwy pliku jest interpretowany jako znak otwierający znacznik.

Inne cenne uwagi:
Błąd u klienta:
<URL: https://bugs.kde.org/show_bug.cgi?id=441255 >

Implementacja wtyczki:
 "--pretty=format:
<tr> 
<td><a href=\"rev:%h\">%h</a></td> <td>%ad</td> <td>%s</td> <td>%an</td> 
</tr>"

Podobne zgłoszenie:
<URL: https://public-inbox.org/git/CACPiFCLzsiUjx-vm-dcd=0E8HezMWkErPyS==OQ7OhaXqR6CUA@mail.gmail.com/ >

Proponowane rozwiązanie:
W odróżnieniu od omawianych powyżej trudności używaniem formatu wynikowego 
JSON, moim zdaniem w tym przypadku wystarczyłoby zakodować znaki [<] i [&] 
w treści w sposób odpowiedni dla HTML.
(To rozwiązanie nie zakłada wykrywania i odrzucania znaków nieprawidłowych.)

[Informacje o systemie]
wersja gita:
git version 2.32.0
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.13.12-1-default #1 SMP Wed Aug 18 08:01:38 UTC 2021 (999e604) 
x86_64
informacje o kompilacji: gnuc: 11.1
informacje o bibliotece libc: glibc: 2.33
$SHELL (typically, interactive shell): /bin/bash

[Włączone skrypty Gita]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-24  9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski
@ 2021-08-24 10:31 ` Bagas Sanjaya
  2021-08-24 10:33   ` Krzysztof Żelechowski
  2021-08-25  0:57 ` Jeff King
  1 sibling, 1 reply; 33+ messages in thread
From: Bagas Sanjaya @ 2021-08-24 10:31 UTC (permalink / raw)
  To: Krzysztof Żelechowski, git

On 24/08/21 16.00, Krzysztof Żelechowski wrote:
> Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem)
> { git log --oneline --encoding=HTML stl_function.h; }
> 
> Co powinno się stać? (Oczekiwane zachowanie)
> 828176ba490 libstdc++: Improve doxygen comments in &lt;bits/stl_function.h>
> 
> Co stało się zamiast tego? (Rzeczywiste zachowanie)
> 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h>
> 
> Jaka jest różnica między tym, co powinno się stać, a tym, co się stało?
> Znak początku nazwy pliku jest interpretowany jako znak otwierający znacznik.
> 
> Inne cenne uwagi:
> Błąd u klienta:
> <URL: https://bugs.kde.org/show_bug.cgi?id=441255 >
> 
> Implementacja wtyczki:
>   "--pretty=format:
> <tr>
> <td><a href=\"rev:%h\">%h</a></td> <td>%ad</td> <td>%s</td> <td>%an</td>
> </tr>"
> 
> Podobne zgłoszenie:
> <URL: https://public-inbox.org/git/CACPiFCLzsiUjx-vm-dcd=0E8HezMWkErPyS==OQ7OhaXqR6CUA@mail.gmail.com/ >
> 
> Proponowane rozwiązanie:
> W odróżnieniu od omawianych powyżej trudności używaniem formatu wynikowego
> JSON, moim zdaniem w tym przypadku wystarczyłoby zakodować znaki [<] i [&]
> w treści w sposób odpowiedni dla HTML.
> (To rozwiązanie nie zakłada wykrywania i odrzucania znaków nieprawidłowych.)
> 
> [Informacje o systemie]
> wersja gita:
> git version 2.32.0
> cpu: x86_64
> no commit associated with this build
> sizeof-long: 8
> sizeof-size_t: 8
> shell-path: /bin/sh
> uname: Linux 5.13.12-1-default #1 SMP Wed Aug 18 08:01:38 UTC 2021 (999e604)
> x86_64
> informacje o kompilacji: gnuc: 11.1
> informacje o bibliotece libc: glibc: 2.33
> $SHELL (typically, interactive shell): /bin/bash
> 
> 
> [Włączone skrypty Gita]
> 
> 
> 

Please speak English here (in other words, re-submit git-bugreport 
without l10n).

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-24 10:31 ` Bagas Sanjaya
@ 2021-08-24 10:33   ` Krzysztof Żelechowski
  2021-08-24 10:46     ` Bagas Sanjaya
  0 siblings, 1 reply; 33+ messages in thread
From: Krzysztof Żelechowski @ 2021-08-24 10:33 UTC (permalink / raw)
  To: git, Bagas Sanjaya

Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze:

> Please speak English here (in other words, re-submit git-bugreport
> without l10n).

How do I do that?

Chris



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-24 10:33   ` Krzysztof Żelechowski
@ 2021-08-24 10:46     ` Bagas Sanjaya
  2021-08-24 19:11       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Bagas Sanjaya @ 2021-08-24 10:46 UTC (permalink / raw)
  To: Krzysztof Żelechowski, git

On 24/08/21 17.33, Krzysztof Żelechowski wrote:
> Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze:
> 
>> Please speak English here (in other words, re-submit git-bugreport
>> without l10n).
> 
> How do I do that?

You need to set locale to English when executing `git bugreport`:

```
LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 /path/to/git bugreport
```

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-24 10:46     ` Bagas Sanjaya
@ 2021-08-24 19:11       ` Junio C Hamano
  0 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2021-08-24 19:11 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Bagas Sanjaya, Krzysztof Żelechowski, git

Bagas Sanjaya <bagasdotme@gmail.com> writes:

> On 24/08/21 17.33, Krzysztof Żelechowski wrote:
>> Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze:
>> 
>>> Please speak English here (in other words, re-submit git-bugreport
>>> without l10n).
>> How do I do that?
>
> You need to set locale to English when executing `git bugreport`:
>
> ```
> LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 /path/to/git bugreport
> ```

Emily, what's your take on this exchange?

I recall that many people (me included) went for "user friendlyness"
by pushing to localize the questionnaire, but here, it seems to be
backfiring at us.

I personally think that it is OK to give a localized questionnaire
and let volunteers who can speak the language help non-English
speakers, but at the same time, it may be a good idea to hint that
filling in the answers in English would give a better chance for
their problems to be looked at (in a localized message).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-24  9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski
  2021-08-24 10:31 ` Bagas Sanjaya
@ 2021-08-25  0:57 ` Jeff King
  2021-08-25 16:31   ` Junio C Hamano
                     ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jeff King @ 2021-08-25  0:57 UTC (permalink / raw)
  To: Krzysztof Żelechowski; +Cc: git

On Tue, Aug 24, 2021 at 11:00:03AM +0200, Krzysztof Żelechowski wrote:

> Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem)
> { git log --oneline --encoding=HTML stl_function.h; }
> 
> Co powinno się stać? (Oczekiwane zachowanie)
> 828176ba490 libstdc++: Improve doxygen comments in &lt;bits/stl_function.h>
> 
> Co stało się zamiast tego? (Rzeczywiste zachowanie)
> 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h>

I can't read the non-English parts of the email, but I gather you were
expecting "--encoding=HTML" to escape syntactically significant HTML
characters. It's not that kind of "encoding", but more "which character
set are you using" (utf8 vs iso8859-1, etc).

We feed the encoding "HTML" to iconv_open(), which of course has no idea
what that is. It's unfortunate, though, that we don't even print a
warning, and instead just quietly leave the text intact. I wonder if we
should do something like:

diff --git a/pretty.c b/pretty.c
index 535eb97fa6..708b618cfe 100644
--- a/pretty.c
+++ b/pretty.c
@@ -672,7 +672,11 @@ const char *repo_logmsg_reencode(struct repository *r,
 	 * If the re-encoding failed, out might be NULL here; in that
 	 * case we just return the commit message verbatim.
 	 */
-	return out ? out : msg;
+	if (!out) {
+		warning("unable to reencode commit to '%s'", output_encoding);
+		return msg;
+	}
+	return out;
 }

 static int mailmap_name(const char **email, size_t *email_len,

As far as what you're trying to accomplish, HTML-escaping isn't
something Git supports. You'll have to run the output through an
external escaping mechanism.

-Peff

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25  0:57 ` Jeff King
@ 2021-08-25 16:31   ` Junio C Hamano
  2021-08-27 18:30     ` Jeff King
  2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
  2021-08-25 23:28   ` Krzysztof Żelechowski
  2 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2021-08-25 16:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Krzysztof Żelechowski, git

Jeff King <peff@peff.net> writes:

> We feed the encoding "HTML" to iconv_open(), which of course has no idea
> what that is. It's unfortunate, though, that we don't even print a
> warning, and instead just quietly leave the text intact. I wonder if we
> should do something like:
>
> diff --git a/pretty.c b/pretty.c
> index 535eb97fa6..708b618cfe 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -672,7 +672,11 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * If the re-encoding failed, out might be NULL here; in that
>  	 * case we just return the commit message verbatim.
>  	 */
> -	return out ? out : msg;
> +	if (!out) {
> +		warning("unable to reencode commit to '%s'", output_encoding);
> +		return msg;
> +	}
> +	return out;
>  }
>  
>  static int mailmap_name(const char **email, size_t *email_len,

This addition sounds quite sensible to me.

"git log --encoding=bogus" would issue this warning for each and
every commit and that may be a bit irritating, but being irritating
may be a good characteristic for a warning message that is given to
an easily correctable condition.

I originally thought that the warning would be lost to the pager,
but apparently I forgot what I did eons ago at 61b80509 (sending
errors to stdout under $PAGER, 2008-02-16) ;-).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25  0:57 ` Jeff King
  2021-08-25 16:31   ` Junio C Hamano
@ 2021-08-25 23:00   ` Krzysztof Żelechowski
  2021-08-27 18:33     ` Jeff King
  2021-08-25 23:28   ` Krzysztof Żelechowski
  2 siblings, 1 reply; 33+ messages in thread
From: Krzysztof Żelechowski @ 2021-08-25 23:00 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze:
> As far as what you're trying to accomplish, HTML-escaping isn't
> something Git supports. You'll have to run the output through an
> external escaping mechanism.

Have you looked at the format?  It is a HTML fragment with placeholders to be 
filled by git log.  I cannot run the output through an external escaping 
mechanism because it will kill the markup that is already there.

Chris





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25  0:57 ` Jeff King
  2021-08-25 16:31   ` Junio C Hamano
  2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
@ 2021-08-25 23:28   ` Krzysztof Żelechowski
  2021-08-25 23:47     ` Bryan Turner
  2 siblings, 1 reply; 33+ messages in thread
From: Krzysztof Żelechowski @ 2021-08-25 23:28 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze:
> diff --git a/pretty.c b/pretty.c

Please fix the manual for git log.  It should say what encoding is recognised 
(namely if supported by iconv(1), except that POSIX character maps of 
iconv(1p) are not supported), and that an unrecognised encoding is ignored.

I would also like to see the HTML encoding supported independently of iconv, 
which seems like a pretty easy thing to do.  Dream on, I guess?

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25 23:28   ` Krzysztof Żelechowski
@ 2021-08-25 23:47     ` Bryan Turner
  2021-08-26 15:37       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Bryan Turner @ 2021-08-25 23:47 UTC (permalink / raw)
  To: Krzysztof Żelechowski; +Cc: Jeff King, Git Users

On Wed, Aug 25, 2021 at 4:29 PM Krzysztof Żelechowski
<giecrilj@stegny.2a.pl> wrote:
>
> Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze:
> > diff --git a/pretty.c b/pretty.c
>
> Please fix the manual for git log.  It should say what encoding is recognised
> (namely if supported by iconv(1), except that POSIX character maps of
> iconv(1p) are not supported), and that an unrecognised encoding is ignored.
>
> I would also like to see the HTML encoding supported independently of iconv,
> which seems like a pretty easy thing to do.  Dream on, I guess?

I suspect the answer is less "Dream on" and more "Patches welcome."

>
> Chris
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25 23:47     ` Bryan Turner
@ 2021-08-26 15:37       ` Junio C Hamano
  2021-08-26 20:52         ` Krzysztof Żelechowski
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2021-08-26 15:37 UTC (permalink / raw)
  To: Bryan Turner; +Cc: Krzysztof Żelechowski, Jeff King, Git Users

Bryan Turner <bturner@atlassian.com> writes:

> On Wed, Aug 25, 2021 at 4:29 PM Krzysztof Żelechowski
> <giecrilj@stegny.2a.pl> wrote:
>>
>> Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze:
>> > diff --git a/pretty.c b/pretty.c
>>
>> Please fix the manual for git log.  It should say what encoding is recognised
>> (namely if supported by iconv(1), except that POSIX character maps of
>> iconv(1p) are not supported), and that an unrecognised encoding is ignored.
>>
>> I would also like to see the HTML encoding supported independently of iconv,
>> which seems like a pretty easy thing to do.  Dream on, I guess?
>
> I suspect the answer is less "Dream on" and more "Patches welcome."

Patches are welcomed but not before a proposed design is freshed
out.  I am sure people do welcome the design discussion.

Pieces taken from the contents stored in Git (like "the title of the
commit", "the name of the author of the commit") may need quoting
and/or escaping when they are incorporated into a string to become
parts of "output", and the way the quoting/escaping must be done
would depend on the "host" language/format.  HTML has its own
requirements for how these pieces coming from Git contents are
quoted, but it will not be the only "host" language that needs
quoting.

The requirement for the feature we are "Dreaming on" may be much
closer to the "host language" options (e.g. --tcl, --perl ...) the
"git for-each-ref" command has.  These options tells us to format
each piece of information (e.g. "%(subject)") taken from Git as a
natural 'string' constant in the host language, so that

	git for-each-ref --shell \
	    --format='do_something %(authorname) %(authoremail)'

would write a shell script that calls "do_something" command with
two arguments for each ref enumerated by the command, without having
to worry about whitespaces and quote characters that may appear in
the interpolated pieces.  It is immediately obvious that within the
context of the for-each-ref command, the follwoing would equally be
useful (note: this is already "dreaming on" and does not exist yet):

	echo "<ul>"
	git for-each-ref --html \
		--format='<li>%(authoremail)</li>'
	echo "</ul>"

As we have been seeing efforts to port features around the --format
option between the for-each-ref family of commands and the log
family of commands, I would also imagine that it would be natural
future direction to extend it to the latter and eventually allow

	git log --html \
		--format='<tr><td>%h</td><td>%s</td>...</tr>'

to format each commit into a single row in HTML table, and things
like that.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-26 15:37       ` Junio C Hamano
@ 2021-08-26 20:52         ` Krzysztof Żelechowski
  2021-08-27 15:59           ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Krzysztof Żelechowski @ 2021-08-26 20:52 UTC (permalink / raw)
  To: Bryan Turner, Junio C Hamano; +Cc: Jeff King, Git Users

Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze:
>         git log --html \
>                 --format='<tr><td>%h</td><td>%s</td>...</tr>'

I would like to be able to say:
 { git config i18n.logOutputEscape HTML; }

What do you think?

Chris





^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-26 20:52         ` Krzysztof Żelechowski
@ 2021-08-27 15:59           ` Junio C Hamano
  2021-08-27 18:37             ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2021-08-27 15:59 UTC (permalink / raw)
  To: Krzysztof Żelechowski; +Cc: Bryan Turner, Jeff King, Git Users

Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes:

> Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze:
>>         git log --html \
>>                 --format='<tr><td>%h</td><td>%s</td>...</tr>'
>
> I would like to be able to say:
>  { git config i18n.logOutputEscape HTML; }
>
> What do you think?

It depends on what it does.

If the configuration means that "git log" output (with any supported
options, like "-p") will be given with '<' written to '^lt;' etc. so
that it becomes safe to dump it in HTML, it fails to interest me at
all.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25 16:31   ` Junio C Hamano
@ 2021-08-27 18:30     ` Jeff King
  2021-08-27 18:32       ` Jeff King
  2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 33+ messages in thread
From: Jeff King @ 2021-08-27 18:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Krzysztof Żelechowski, git

On Wed, Aug 25, 2021 at 09:31:59AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > We feed the encoding "HTML" to iconv_open(), which of course has no idea
> > what that is. It's unfortunate, though, that we don't even print a
> > warning, and instead just quietly leave the text intact. I wonder if we
> > should do something like:
> [...]
> This addition sounds quite sensible to me.
> 
> "git log --encoding=bogus" would issue this warning for each and
> every commit and that may be a bit irritating, but being irritating
> may be a good characteristic for a warning message that is given to
> an easily correctable condition.
> 
> I originally thought that the warning would be lost to the pager,
> but apparently I forgot what I did eons ago at 61b80509 (sending
> errors to stdout under $PAGER, 2008-02-16) ;-).

Here it is polished into a real commit.

-- >8 --
Subject: [PATCH] logmsg_reencode(): warn when iconv() fails

If the user asks for a pretty-printed commit to be converted (either
explicitly with --encoding=foo, or implicitly because the commit is
non-utf8 and we want to convert it), we pass it through iconv(). If that
fails, we fall back to showing the input verbatim, but don't tell the
user that the output may be bogus.

Let's add a warning to do so, along with a mention in the documentation
for --encoding. Two things to note about the implementation:

  - we could produce the warning closer to the call to iconv() in
    reencode_string_len(), which would let us relay the value of errno.
    But this is not actually very helpful. reencode_string_len() does
    not know we are operating on a commit, and indeed does not know that
    the caller won't produce an error of its own. And the errno values
    from iconv() are seldom helpful (iconv_open() only ever produces
    EINVAL; perhaps EILSEQ from iconv() might be illuminating, but it
    can also return EINVAL for incomplete sequences).

  - if the reason for the failure is that the output charset is not
    supported, then the user will see this warning for every commit we
    try to display. That might be ugly and overwhelming, but on the
    other hand it is making it clear that every one of them has not been
    converted (and the likely outcome anyway is to re-try the command
    with a supported output encoding).

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/pretty-options.txt | 4 +++-
 pretty.c                         | 6 +++++-
 t/t4210-log-i18n.sh              | 7 +++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 27ddaf84a1..42b227bc40 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -40,7 +40,9 @@ people using 80-column terminals.
 	defaults to UTF-8. Note that if an object claims to be encoded
 	in `X` and we are outputting in `X`, we will output the object
 	verbatim; this means that invalid sequences in the original
-	commit may be copied to the output.
+	commit may be copied to the output. Likewise, if iconv(3) fails
+	to convert the commit, we will output the original object
+	verbatim, along with a warning.
 
 --expand-tabs=<n>::
 --expand-tabs::
diff --git a/pretty.c b/pretty.c
index 9631529c10..73b5ead509 100644
--- a/pretty.c
+++ b/pretty.c
@@ -671,7 +671,11 @@ const char *repo_logmsg_reencode(struct repository *r,
 	 * If the re-encoding failed, out might be NULL here; in that
 	 * case we just return the commit message verbatim.
 	 */
-	return out ? out : msg;
+	if (!out) {
+		warning("unable to reencode commit to '%s'", output_encoding);
+		return msg;
+	}
+	return out;
 }
 
 static int mailmap_name(const char **email, size_t *email_len,
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index d2dfcf164e..0141f36e33 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -131,4 +131,11 @@ do
 	fi
 done
 
+test_expect_success 'log shows warning when conversion fails' '
+	enc=this-encoding-does-not-exist &&
+	git log -1 --encoding=$enc 2>err &&
+	echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect &&
+	test_cmp expect err
+'
+
 test_done
-- 
2.33.0.396.g72f622fe47


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-27 18:30     ` Jeff King
@ 2021-08-27 18:32       ` Jeff King
  2021-08-27 19:47         ` Junio C Hamano
  2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff King @ 2021-08-27 18:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Krzysztof Żelechowski, git

On Fri, Aug 27, 2021 at 02:30:16PM -0400, Jeff King wrote:

> Here it is polished into a real commit.
> 
> Subject: [PATCH] logmsg_reencode(): warn when iconv() fails

And here's a minimal documentation I'd suggest on top. We can discuss
going further in discussing subtleties of iconv() if we want, but IMHO
it would work to stop here.

-- >8 --
Subject: [PATCH] docs: use "character encoding" to refer to commit-object
 encoding

The word "encoding" can mean a lot of things (e.g., base64 or
quoted-printable encoding in emails, HTML entities, URL encoding, and so
on). The documentation for i18n.commitEncoding and i18n.logOutputEncoding
uses the phrase "character encoding" to make this more clear.

Let's use that phrase in other places to make it clear what kind of
encoding we are talking about. This patch covers the gui.encoding
option, as well as the --encoding option for git-log, etc (in this
latter case, I word-smithed the sentence a little at the same time).
That, coupled with the mention of iconv in the --encoding description,
should make this more clear.

The other spot I looked at is the working-tree-encoding section of
gitattributes(5). But it gives specific examples of encodings that I
think make the meaning pretty clear already.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/config/gui.txt     | 2 +-
 Documentation/pretty-options.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/gui.txt b/Documentation/config/gui.txt
index d30831a130..0c087fd8c9 100644
--- a/Documentation/config/gui.txt
+++ b/Documentation/config/gui.txt
@@ -11,7 +11,7 @@ gui.displayUntracked::
 	in the file list. The default is "true".

 gui.encoding::
-	Specifies the default encoding to use for displaying of
+	Specifies the default character encoding to use for displaying of
 	file contents in linkgit:git-gui[1] and linkgit:gitk[1].
 	It can be overridden by setting the 'encoding' attribute
 	for relevant files (see linkgit:gitattributes[5]).
diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 42b227bc40..b3af850608 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -33,7 +33,7 @@ people using 80-column terminals.
 	used together.

 --encoding=<encoding>::
-	The commit objects record the encoding used for the log message
+	Commit objects record the character encoding used for the log message
 	in their encoding header; this option can be used to tell the
 	command to re-code the commit log message in the encoding
 	preferred by the user.  For non plumbing commands this
-- 
2.33.0.396.g72f622fe47

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
@ 2021-08-27 18:33     ` Jeff King
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2021-08-27 18:33 UTC (permalink / raw)
  To: Krzysztof Żelechowski; +Cc: git

On Thu, Aug 26, 2021 at 01:00:44AM +0200, Krzysztof Żelechowski wrote:

> Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze:
> > As far as what you're trying to accomplish, HTML-escaping isn't
> > something Git supports. You'll have to run the output through an
> > external escaping mechanism.
> 
> Have you looked at the format?  It is a HTML fragment with placeholders to be 
> filled by git log.  I cannot run the output through an external escaping 
> mechanism because it will kill the markup that is already there.

Right, what I mean is that you'd have to pull the output out of Git, and
then format (and escape) it separately.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-27 15:59           ` Junio C Hamano
@ 2021-08-27 18:37             ` Jeff King
  2021-08-27 21:51               ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2021-08-27 18:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Krzysztof Żelechowski, Bryan Turner, Git Users

On Fri, Aug 27, 2021 at 08:59:40AM -0700, Junio C Hamano wrote:

> Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes:
> 
> > Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze:
> >>         git log --html \
> >>                 --format='<tr><td>%h</td><td>%s</td>...</tr>'
> >
> > I would like to be able to say:
> >  { git config i18n.logOutputEscape HTML; }
> >
> > What do you think?
> 
> It depends on what it does.
> 
> If the configuration means that "git log" output (with any supported
> options, like "-p") will be given with '<' written to '^lt;' etc. so
> that it becomes safe to dump it in HTML, it fails to interest me at
> all.

Yeah, I think things get pretty weird when you start thinking about
dumping whole filenames and diff contents.

I wouldn't be opposed to an option for the pretty formatter to have
encodings. Something like:

  git log --format='%(authorname:quote=html)'

I'd probably put off implementing that until we actually unify the
for-each-ref and pretty formats, though (we do not even have
%(authorname) at this point!). The latter already has a quoting
mechanism for shell/perl/python/tcl (though it is not per-atom, and I
wouldn't be opposed to a --format-quote option that quoted all pretty.c
placeholders).

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-27 18:32       ` Jeff King
@ 2021-08-27 19:47         ` Junio C Hamano
  0 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2021-08-27 19:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Krzysztof Żelechowski, git

Jeff King <peff@peff.net> writes:

> On Fri, Aug 27, 2021 at 02:30:16PM -0400, Jeff King wrote:
>
>> Here it is polished into a real commit.
>> 
>> Subject: [PATCH] logmsg_reencode(): warn when iconv() fails
>
> And here's a minimal documentation I'd suggest on top. We can discuss
> going further in discussing subtleties of iconv() if we want, but IMHO
> it would work to stop here.

Thanks, both patches look sensible.

Will queue.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: git log --encoding=HTML is not supported
  2021-08-27 18:37             ` Jeff King
@ 2021-08-27 21:51               ` Junio C Hamano
  0 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2021-08-27 21:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Krzysztof Żelechowski, Bryan Turner, Git Users

Jeff King <peff@peff.net> writes:

> I wouldn't be opposed to an option for the pretty formatter to have
> encodings. Something like:
>
>   git log --format='%(authorname:quote=html)'
>
> I'd probably put off implementing that until we actually unify the
> for-each-ref and pretty formats, though (we do not even have
> %(authorname) at this point!). The latter already has a quoting
> mechanism for shell/perl/python/tcl (though it is not per-atom, and I
> wouldn't be opposed to a --format-quote option that quoted all pretty.c
> placeholders).

Yeah, per-atom would be nice, as we can specify which piece needs
what kind of quoting, e.g.

    git log --format='
	if test %(authoremail:quote=shell) != "gitster@pobox.com"
	then
	    echo %(authorname:quote=html+shell)
	fi
    ' | sh

can be used to write a script to produce an "echo" command with a
shell literal string as its argument, where that literal string
writes author's name in a way that can be inserted in an HTML
document, but omitting the commits by me.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* *Really* noisy encoding warnings post-v2.33.0
  2021-08-27 18:30     ` Jeff King
  2021-08-27 18:32       ` Jeff King
@ 2021-10-09  0:58       ` Ævar Arnfjörð Bjarmason
  2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
  2021-10-09  2:36         ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-09  0:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz


On Fri, Aug 27 2021, Jeff King wrote:

$subject because I think we should really consider backing this out
before it gets to a real release.

I ran into this while testing the grep coloring patch[1] (but it's
unrelated). Before this commit e.g.:

    LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l
    28333

So ~3k lines for my last 100 commits, but then:

    $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning
    299

At first I thought it was spewing warnings for every failed re-encoded
line in some cases, because I get hundreds at a time sometimes, but it's
because stderr and stdout I/O buffering is different (a common
case). Adding a "fflush(stderr)" "fixes" that.

But anyway, I think we've got a lot of users who say *do* want to
reencode something from say UTF-8 to latin1, but then might have the
occasional non-latin1 representable data. The old behavior of silently
falling back is going to be much better for those users, or maybe show
one warning at the end or something, if you feel it really needs to be
kept.

> On Wed, Aug 25, 2021 at 09:31:59AM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > We feed the encoding "HTML" to iconv_open(), which of course has no idea
>> > what that is. It's unfortunate, though, that we don't even print a
>> > warning, and instead just quietly leave the text intact. I wonder if we
>> > should do something like:
>> [...]
>> This addition sounds quite sensible to me.
>> 
>> "git log --encoding=bogus" would issue this warning for each and
>> every commit and that may be a bit irritating, but being irritating
>> may be a good characteristic for a warning message that is given to
>> an easily correctable condition.
>> 
>> I originally thought that the warning would be lost to the pager,
>> but apparently I forgot what I did eons ago at 61b80509 (sending
>> errors to stdout under $PAGER, 2008-02-16) ;-).
>
> Here it is polished into a real commit.
>
> -- >8 --
> Subject: [PATCH] logmsg_reencode(): warn when iconv() fails
>
> If the user asks for a pretty-printed commit to be converted (either
> explicitly with --encoding=foo, or implicitly because the commit is
> non-utf8 and we want to convert it), we pass it through iconv(). If that
> fails, we fall back to showing the input verbatim, but don't tell the
> user that the output may be bogus.
>
> Let's add a warning to do so, along with a mention in the documentation
> for --encoding. Two things to note about the implementation:
>
>   - we could produce the warning closer to the call to iconv() in
>     reencode_string_len(), which would let us relay the value of errno.
>     But this is not actually very helpful. reencode_string_len() does
>     not know we are operating on a commit, and indeed does not know that
>     the caller won't produce an error of its own. And the errno values
>     from iconv() are seldom helpful (iconv_open() only ever produces
>     EINVAL; perhaps EILSEQ from iconv() might be illuminating, but it
>     can also return EINVAL for incomplete sequences).
>
>   - if the reason for the failure is that the output charset is not
>     supported, then the user will see this warning for every commit we
>     try to display. That might be ugly and overwhelming, but on the
>     other hand it is making it clear that every one of them has not been
>     converted (and the likely outcome anyway is to re-try the command
>     with a supported output encoding).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/pretty-options.txt | 4 +++-
>  pretty.c                         | 6 +++++-
>  t/t4210-log-i18n.sh              | 7 +++++++
>  3 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
> index 27ddaf84a1..42b227bc40 100644
> --- a/Documentation/pretty-options.txt
> +++ b/Documentation/pretty-options.txt
> @@ -40,7 +40,9 @@ people using 80-column terminals.
>  	defaults to UTF-8. Note that if an object claims to be encoded
>  	in `X` and we are outputting in `X`, we will output the object
>  	verbatim; this means that invalid sequences in the original
> -	commit may be copied to the output.
> +	commit may be copied to the output. Likewise, if iconv(3) fails
> +	to convert the commit, we will output the original object
> +	verbatim, along with a warning.
>  
>  --expand-tabs=<n>::
>  --expand-tabs::
> diff --git a/pretty.c b/pretty.c
> index 9631529c10..73b5ead509 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -671,7 +671,11 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * If the re-encoding failed, out might be NULL here; in that
>  	 * case we just return the commit message verbatim.
>  	 */
> -	return out ? out : msg;
> +	if (!out) {
> +		warning("unable to reencode commit to '%s'", output_encoding);
> +		return msg;
> +	}
> +	return out;
>  }
>  
>  static int mailmap_name(const char **email, size_t *email_len,
> diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
> index d2dfcf164e..0141f36e33 100755
> --- a/t/t4210-log-i18n.sh
> +++ b/t/t4210-log-i18n.sh
> @@ -131,4 +131,11 @@ do
>  	fi
>  done
>  
> +test_expect_success 'log shows warning when conversion fails' '
> +	enc=this-encoding-does-not-exist &&
> +	git log -1 --encoding=$enc 2>err &&
> +	echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect &&
> +	test_cmp expect err
> +'
> +
>  test_done


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
@ 2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
  2021-10-09  2:36         ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-09  1:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz


On Sat, Oct 09 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Aug 27 2021, Jeff King wrote:
>
> $subject because I think we should really consider backing this out
> before it gets to a real release.
>
> I ran into this while testing the grep coloring patch[1] (but it's
> unrelated). Before this commit e.g.:
>
>     LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l
>     28333
>
> So ~3k lines for my last 100 commits, but then:
>
>     $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning
>     299
>
> At first I thought it was spewing warnings for every failed re-encoded
> line in some cases, because I get hundreds at a time sometimes, but it's
> because stderr and stdout I/O buffering is different (a common
> case). Adding a "fflush(stderr)" "fixes" that.

It's partially that, but also more pathologically:

    $ git -P -c i18n.commitEncoding=ascii log --author=doesnotexist -1 | wc -l
    0
    $ git -P -c i18n.commitEncoding=ascii log --author=doesnotexist -1 2>&1 |wc -l
    6688

I.e. even if we don't end up emitting anything we'll warn, of course we
might not match *because* we failed, e.g. if you had a non-ascii --grep
string, but in this case it's rather noisy.

> But anyway, I think we've got a lot of users who say *do* want to
> reencode something from say UTF-8 to latin1, but then might have the
> occasional non-latin1 representable data. The old behavior of silently
> falling back is going to be much better for those users, or maybe show
> one warning at the end or something, if you feel it really needs to be
> kept.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
  2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
@ 2021-10-09  2:36         ` Jeff King
  2021-10-09  2:42           ` Jeff King
  2021-10-10 13:53           ` Johannes Sixt
  1 sibling, 2 replies; 33+ messages in thread
From: Jeff King @ 2021-10-09  2:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz

On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote:

> I ran into this while testing the grep coloring patch[1] (but it's
> unrelated). Before this commit e.g.:
> 
>     LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l
>     28333
> 
> So ~3k lines for my last 100 commits, but then:
> 
>     $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning
>     299
> 
> At first I thought it was spewing warnings for every failed re-encoded
> line in some cases, because I get hundreds at a time sometimes, but it's
> because stderr and stdout I/O buffering is different (a common
> case). Adding a "fflush(stderr)" "fixes" that.

I don't think the buffering is the issue. By default stderr flushes on
lines, and we flush commits after showing them. If you take away "-P"
(or look at the combined 2>&1 output in order), you'll see that they are
grouped.

Now one thing you might notice is that there may be multiple warnings
between output commits. But that's because we really are re-encoding
each of those intermediate commits to do your --author grep. And if that
re-encoding fails, we may well be producing the wrong output, because
the matching won't be correct (in your case, presumably the correct
output should be _nothing_, because Æ is not an ascii character).

I do think the current warning is particularly bad there, because it
doesn't even mention the commit oid. So something like:

diff --git a/pretty.c b/pretty.c
index 708b618cfe..ddf501632d 100644
--- a/pretty.c
+++ b/pretty.c
@@ -673,7 +673,8 @@ const char *repo_logmsg_reencode(struct repository *r,
 	 * case we just return the commit message verbatim.
 	 */
 	if (!out) {
-		warning("unable to reencode commit to '%s'", output_encoding);
+		warning("unable to reencode commit %s to '%s'",
+			oid_to_hex(&commit->object.oid), output_encoding);
 		return msg;
 	}
 	return out;

means you get output like:

  $ git -c i18n.commitEncoding=ascii log --format='%h %s' --author=Ævar -100
  warning: unable to reencode commit c90cfc225baaf64af311f7e2953267e4de636205 to 'ascii'
  warning: unable to reencode commit 1d1d731d30cbcd5f3a6a5cbac1fe218e4d4db72b to 'ascii'
  warning: unable to reencode commit 66237bcf60df357f188551e1ea4db90f94c519ae to 'ascii'
  warning: unable to reencode commit 100c2da2d3a330366588143d720f09a88926972a to 'ascii'
  warning: unable to reencode commit 59580685bee17de3efff614df7f508133d1e4a7a to 'ascii'
  59580685be config.h: remove unused git_config_get_untracked_cache() declaration
  warning: unable to reencode commit 067e73c8aee9aeb05eac939205274cd2ad8b7cae to 'ascii'
  067e73c8ae log-tree.h: remove unused function declarations
  [...etc...]

If that were coupled with, say, an advise() call to explain that output
and matching might be inaccurate (and show that _once_), that might
might it more clear what's going on.

Now I am sympathetic to flooding the user with too many messages, and
maybe reducing this to a single instance of "some commit messages could
not be re-encoded; output and matching might be inaccurate" is the right
thing. But in a sense, it's also working as designed: what you asked for
is producing wrong output over and over, and Git is saying so.

I'm not even sure what you're trying to do with that command. It could
never output a single correct commit, because you've asked to match only
commits that will be shown in the wrong encoding.

> But anyway, I think we've got a lot of users who say *do* want to
> reencode something from say UTF-8 to latin1, but then might have the
> occasional non-latin1 representable data. The old behavior of silently
> falling back is going to be much better for those users, or maybe show
> one warning at the end or something, if you feel it really needs to be
> kept.

If there are real-world cases where the quantity of errors is really
getting in the way, I'm open to the idea of having a single error
message. And personally, I don't really have any experience working with
broken encodings (all my commits are in utf8, and that's what I use as
output). It just seems weird to me that 'git log --encoding=foo' would
quietly ignore the option entirely (i.e., the old behavior, which did
lead to a confused user and a post to the list).

-Peff

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  2:36         ` Jeff King
@ 2021-10-09  2:42           ` Jeff King
  2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
  2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
  2021-10-10 13:53           ` Johannes Sixt
  1 sibling, 2 replies; 33+ messages in thread
From: Jeff King @ 2021-10-09  2:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz

On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote:

> If that were coupled with, say, an advise() call to explain that output
> and matching might be inaccurate (and show that _once_), that might
> might it more clear what's going on.
> 
> Now I am sympathetic to flooding the user with too many messages, and
> maybe reducing this to a single instance of "some commit messages could
> not be re-encoded; output and matching might be inaccurate" is the right
> thing. But in a sense, it's also working as designed: what you asked for
> is producing wrong output over and over, and Git is saying so.

The single-output version would perhaps be something like this:

diff --git a/pretty.c b/pretty.c
index 708b618cfe..c86f41bae7 100644
--- a/pretty.c
+++ b/pretty.c
@@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding)
 	return strbuf_detach(&tmp, NULL);
 }
 
+static void show_encoding_warning(const char *output_encoding)
+{
+	static int seen_warning;
+
+	if (seen_warning)
+		return;
+
+	seen_warning = 1;
+	warning("one or more commits could not be re-encoded to '%s'",
+		output_encoding);
+	advise("When re-encoding fails, some output may be in an unexpected\n"
+	       "encoding, and pattern matches against commit data may be\n"
+	       "inaccurate.");
+}
+
 const char *repo_logmsg_reencode(struct repository *r,
 				 const struct commit *commit,
 				 char **commit_encoding,
@@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r,
 	 * case we just return the commit message verbatim.
 	 */
 	if (!out) {
-		warning("unable to reencode commit to '%s'", output_encoding);
+		show_encoding_warning(output_encoding);
 		return msg;
 	}
 	return out;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  2:42           ` Jeff King
@ 2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
  2021-10-27 11:03               ` Jeff King
  2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-09 13:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz

On Fri, Oct 08 2021, Jeff King wrote:

> On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote:
>
>> If that were coupled with, say, an advise() call to explain that output
>> and matching might be inaccurate (and show that _once_), that might
>> might it more clear what's going on.
>> 
>> Now I am sympathetic to flooding the user with too many messages, and
>> maybe reducing this to a single instance of "some commit messages could
>> not be re-encoded; output and matching might be inaccurate" is the right
>> thing. But in a sense, it's also working as designed: what you asked for
>> is producing wrong output over and over, and Git is saying so.
>
> The single-output version would perhaps be something like this:
>
> diff --git a/pretty.c b/pretty.c
> index 708b618cfe..c86f41bae7 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding)
>  	return strbuf_detach(&tmp, NULL);
>  }
>  
> +static void show_encoding_warning(const char *output_encoding)
> +{
> +	static int seen_warning;
> +
> +	if (seen_warning)
> +		return;
> +
> +	seen_warning = 1;
> +	warning("one or more commits could not be re-encoded to '%s'",
> +		output_encoding);
> +	advise("When re-encoding fails, some output may be in an unexpected\n"
> +	       "encoding, and pattern matches against commit data may be\n"
> +	       "inaccurate.");
> +}
> +
>  const char *repo_logmsg_reencode(struct repository *r,
>  				 const struct commit *commit,
>  				 char **commit_encoding,
> @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * case we just return the commit message verbatim.
>  	 */
>  	if (!out) {
> -		warning("unable to reencode commit to '%s'", output_encoding);
> +		show_encoding_warning(output_encoding);
>  		return msg;
>  	}
>  	return out;

I'm not categorically opposed to having this warning stay, because I can
imagine an implementation of it that isn't so overmatching, but I think
neither one of us is going to work on that, so ....

We have the exact same edge case in the grep-a-file code, and it's a
delibirate decision to stay quiet about it.

Let's assume your pattern contains non-ASCII, you've asked for
locale-aware grepping, you want \d+ to mean all sorts of Unicode
fuzzyness instead of just [0-9] etc. (not yet implemented, PCRE_UCP).

It would still be annoying to see a warning every time you grep without
providing a pathspec that blacklists say the 100 '*.png' files that are
in your tree.

And that's a case where we *could* say that the user should mark them
with .gitattributes or whatever, but making every git user go through
that would be annoying to them, so we just do our best and silently fall
back.

Similarly with this, let's say I'm on an OS that likes UTF-16 better, as
some of our users do, I have the relevant settings to re-encode git.git
or linux.git. Now run:

    git -c i18n.logOutputEncoding=utf16 log --author=foobarbaz

And it's 2 warnings in git.git, and 157 in linux.git. Anyway, your
commit above makes that 1 in both cases, which is certainly a *lot*
better.

But I think similar to the grep-a-file case it's still way to much, now
just because I've got some old badly encoded commits in my history I'll
see one warning every time a log revision walk/grep comes across those.

On the "not categorically opposed" I think that this sort of warning
/might/ be good if:

 * It weren't enabled by default, or at least as a transition had
   something like a advise() message pointing at a fsck.skipList-like
   (or other instructions, replace?) about how to quiet it.

 * We're realy dumb with how we chain data->iconv->PCRE here. I.e. we'll
   whine that we can't reencode just to match my "foobarbaz", but we
   could just keep walking past bad bytes. We should ideally say "we
   might have matched your data, but *because* of the encoding failure
   we couldn't.

   We can easily know with something like "foobarbaz" that that's not
   the case.

Anyway, I think all of that we can leave for the future, because I'd
simply assumed that this was based on some report that had to do with
someone not matching with --grep or whatever because of the details of
the encoding going wrong, e.g. a string that's later in a commit
message, but a misencoded character tripped it up.

But in this case this seems to have been because someone tried to feed
"HTML" to it, which is not an encoding, and something iconv_open() has
(I daresay) always and will always error on. It returns -1 and sets
errno=EINVAL.

So having a warning or other detection in the revision loop seems
backwards to me, surely we want something like the below instead?
I.e. die as close to bad option parsing as possible?

Note that this will now die if we have NO_ICONV=Y, even with your patch,
that seems like a feature. Now we'll silently ignore it. I.e. we'll warn
because we failed to re-encode, but we're using a stub function whose
body is:

    { if (e) *e = 0; return NULL; }

So ditto the garbage encoding name we should have died a lot earlier.

Aside from your warning test the below makes tests in t4201-shortlog.sh
fail, but those just seem broken to me. I.e. they seem to rely on git
staying quiet if i18n.commitencoding is set to garbage.

diff --git a/environment.c b/environment.c
index 43bb1b35ffe..c26b18f8e5c 100644
--- a/environment.c
+++ b/environment.c
@@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)

 const char *get_log_output_encoding(void)
 {
-	return git_log_output_encoding ? git_log_output_encoding
+	const char *encoding = git_log_output_encoding ? git_log_output_encoding
 		: get_commit_output_encoding();
+#ifndef NO_ICONV
+	iconv_t conv;
+	conv = iconv_open(encoding, "UTF-8");
+	if (conv == (iconv_t) -1 && errno == EINVAL)
+		die_errno("the '%s' encoding is not known to iconv", encoding);
+#else
+	if (strcmp(encoding, "UTF-8"))
+		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
+#endif
+	return encoding;
 }

 const char *get_commit_output_encoding(void)

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  2:36         ` Jeff King
  2021-10-09  2:42           ` Jeff King
@ 2021-10-10 13:53           ` Johannes Sixt
  2021-10-10 15:43             ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 33+ messages in thread
From: Johannes Sixt @ 2021-10-10 13:53 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz,
	Ævar Arnfjörð Bjarmason

Am 09.10.21 um 04:36 schrieb Jeff King:
> On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
>> I ran into this while testing the grep coloring patch[1] (but it's
>> unrelated). Before this commit e.g.:
>>
>>     LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l
>>     28333
>>
>> So ~3k lines for my last 100 commits, but then:
>>
>>     $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning
>>     299
>>
>> At first I thought it was spewing warnings for every failed re-encoded
>> line in some cases, because I get hundreds at a time sometimes, but it's
>> because stderr and stdout I/O buffering is different (a common
>> case). Adding a "fflush(stderr)" "fixes" that.
> 
> I don't think the buffering is the issue. By default stderr flushes on
> lines, and we flush commits after showing them. If you take away "-P"
> (or look at the combined 2>&1 output in order), you'll see that they are
> grouped.
> 
> Now one thing you might notice is that there may be multiple warnings
> between output commits. But that's because we really are re-encoding
> each of those intermediate commits to do your --author grep. And if that
> re-encoding fails, we may well be producing the wrong output, because
> the matching won't be correct (in your case, presumably the correct
> output should be _nothing_, because Æ is not an ascii character).

I don't understand why i18n.commitEncoding plays a role here. Isn't it
an instruction "when you make a commit, mark the commit message having
this encoding". But grep does not make a commit.

If this were i18n.logOuputEncoding it would make much more sense.

Have I misunderstood the meaning of the two options?

-- Hannes

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-10 13:53           ` Johannes Sixt
@ 2021-10-10 15:43             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-10 15:43 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Jeff King, Junio C Hamano, Krzysztof Żelechowski, git,
	Hamza Mahfooz


On Sun, Oct 10 2021, Johannes Sixt wrote:

> Am 09.10.21 um 04:36 schrieb Jeff King:
>> On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> 
>>> I ran into this while testing the grep coloring patch[1] (but it's
>>> unrelated). Before this commit e.g.:
>>>
>>>     LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l
>>>     28333
>>>
>>> So ~3k lines for my last 100 commits, but then:
>>>
>>>     $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning
>>>     299
>>>
>>> At first I thought it was spewing warnings for every failed re-encoded
>>> line in some cases, because I get hundreds at a time sometimes, but it's
>>> because stderr and stdout I/O buffering is different (a common
>>> case). Adding a "fflush(stderr)" "fixes" that.
>> 
>> I don't think the buffering is the issue. By default stderr flushes on
>> lines, and we flush commits after showing them. If you take away "-P"
>> (or look at the combined 2>&1 output in order), you'll see that they are
>> grouped.
>> 
>> Now one thing you might notice is that there may be multiple warnings
>> between output commits. But that's because we really are re-encoding
>> each of those intermediate commits to do your --author grep. And if that
>> re-encoding fails, we may well be producing the wrong output, because
>> the matching won't be correct (in your case, presumably the correct
>> output should be _nothing_, because Æ is not an ascii character).
>
> I don't understand why i18n.commitEncoding plays a role here. Isn't it
> an instruction "when you make a commit, mark the commit message having
> this encoding". But grep does not make a commit.
>
> If this were i18n.logOuputEncoding it would make much more sense.
>
> Have I misunderstood the meaning of the two options?

It doesn't, see my later <871r4umfnm.fsf@evledraar.gmail.com> for when I
got it right.

For the amount of warnings etc. it's the same, whether we call iconv
because it's e.g. ascii->utf-8 and that triggers iconv() issues, or
(with i18n.logOuputEncoding) utf-8->ascii.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09  2:42           ` Jeff King
  2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
@ 2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-22 22:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz


On Fri, Oct 08 2021, Jeff King wrote:

> On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote:
>
>> If that were coupled with, say, an advise() call to explain that output
>> and matching might be inaccurate (and show that _once_), that might
>> might it more clear what's going on.
>> 
>> Now I am sympathetic to flooding the user with too many messages, and
>> maybe reducing this to a single instance of "some commit messages could
>> not be re-encoded; output and matching might be inaccurate" is the right
>> thing. But in a sense, it's also working as designed: what you asked for
>> is producing wrong output over and over, and Git is saying so.
>
> The single-output version would perhaps be something like this:
>
> diff --git a/pretty.c b/pretty.c
> index 708b618cfe..c86f41bae7 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding)
>  	return strbuf_detach(&tmp, NULL);
>  }
>  
> +static void show_encoding_warning(const char *output_encoding)
> +{
> +	static int seen_warning;
> +
> +	if (seen_warning)
> +		return;
> +
> +	seen_warning = 1;
> +	warning("one or more commits could not be re-encoded to '%s'",
> +		output_encoding);
> +	advise("When re-encoding fails, some output may be in an unexpected\n"
> +	       "encoding, and pattern matches against commit data may be\n"
> +	       "inaccurate.");
> +}
> +
>  const char *repo_logmsg_reencode(struct repository *r,
>  				 const struct commit *commit,
>  				 char **commit_encoding,
> @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * case we just return the commit message verbatim.
>  	 */
>  	if (!out) {
> -		warning("unable to reencode commit to '%s'", output_encoding);
> +		show_encoding_warning(output_encoding);
>  		return msg;
>  	}
>  	return out;

*Poke* about this. We're getting pretty close to release. I think the
WIP hunk I posted in
https://lore.kernel.org/git/871r4umfnm.fsf@evledraar.gmail.com/ presents
a good way forward with this.

I.e. aside from how wise it is to warn about this in general, I think
there's a pretty bad bug in your implementation where what should
effectively be a parse_options() or git_config()-time one-off warning is
being fired off for every single commit in "git log" in some cases.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
@ 2021-10-27 11:03               ` Jeff King
  2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2021-10-27 11:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz

On Sat, Oct 09, 2021 at 03:47:16PM +0200, Ævar Arnfjörð Bjarmason wrote:

> But in this case this seems to have been because someone tried to feed
> "HTML" to it, which is not an encoding, and something iconv_open() has
> (I daresay) always and will always error on. It returns -1 and sets
> errno=EINVAL.
> 
> So having a warning or other detection in the revision loop seems
> backwards to me, surely we want something like the below instead?
> I.e. die as close to bad option parsing as possible?

Sorry for the slow response; this got thrown on my "to think about and
look at later" pile.

Yeah, I agree that if we sanity-checked the encoding up front, that
would cover the case we saw in practice, and goes a long way towards
catching any practical errors.

But I think this patch is tricky:

> diff --git a/environment.c b/environment.c
> index 43bb1b35ffe..c26b18f8e5c 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)
>  
>  const char *get_log_output_encoding(void)
>  {
> -	return git_log_output_encoding ? git_log_output_encoding
> +	const char *encoding = git_log_output_encoding ? git_log_output_encoding
>  		: get_commit_output_encoding();
> +#ifndef NO_ICONV
> +	iconv_t conv;
> +	conv = iconv_open(encoding, "UTF-8");
> +	if (conv == (iconv_t) -1 && errno == EINVAL)
> +		die_errno("the '%s' encoding is not known to iconv", encoding);
> +#else
> +	if (strcmp(encoding, "UTF-8"))
> +		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
> +#endif
> +	return encoding;
>  }

So one obvious problem here is that we call this function once per
commit, so it's a lot of extra iconv_open() calls. But obviously we
could use a static flag to do it once per process.

The other issue is that it is assuming UTF-8 on one end of the
conversion. But we aren't necessarily doing such a conversion; it
depends on the commit's on-disk encoding, and the requested output
encoding. In particular:

  - if both of those match, we do not need to call iconv at all (see the
    same_encoding() check in repo_logmsg_reencode()). With the patch
    above, the NO_ICONV case would start to die() when both are say
    iso8859-1, even though it currently works.

  - likewise, even if you have iconv support, it's possible that your
    preferred encoding is not compatible with utf8. In which case
    iconv_open() may complain, even though the actual conversion we'd
    ask it to do would succeed.

I.e., I don't think there's a way to just ask iconv "does this encoding
name by itself make any sense". You can only ask it about to/from
combos.

So I think a much better version of this is to catch the _actual_
iconv_open() call we make. And if it fails, say "woah, this combo of
encodings isn't supported". The reason I didn't do that in the earlier
patch is that all of this is obscured inside reencode_string_len(),
which does both the iconv_open() and the iconv() call. We could surface
that error information.

But I'm not sure it would make sense to die() in that case. While for
something like "git log --encoding=nonsense" every commit is going to
fail to re-encode, it's still possible that iconv_open() failures are
commit-specific. I.e., you could have some garbage commit in your
history with an unsupported encoding, and you wouldn't want to die() for
it (it's the same case you are complaining about having a warning for,
but much worse).

I suspect the best we could do along these lines is to wait until a real
iconv_open(to, from) fails, and then as a fallback try:

  iconv_open("UTF-8", from);
  iconv_open(to, "UTF-8");

to sanity-check them individually, and guess that one of them is broken
if it can't go to/from UTF-8. But even that feels like it's making
assumptions about both the system iconv, and the charsets people use.

To be clear, I'd expect that most people just use utf-8 in the first
place, and even if they don't that their system has some basic utf-8
support. But we are deep into the realm of weird corner cases here, and
the utility of this warning / error-checking doesn't seem high enough to
merit the possible regressions we'd get by trying to make too many
assumptions.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-27 11:03               ` Jeff King
@ 2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason
  2021-10-29 20:40                   ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-29 10:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz


On Wed, Oct 27 2021, Jeff King wrote:

> On Sat, Oct 09, 2021 at 03:47:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> But in this case this seems to have been because someone tried to feed
>> "HTML" to it, which is not an encoding, and something iconv_open() has
>> (I daresay) always and will always error on. It returns -1 and sets
>> errno=EINVAL.
>> 
>> So having a warning or other detection in the revision loop seems
>> backwards to me, surely we want something like the below instead?
>> I.e. die as close to bad option parsing as possible?
>
> Sorry for the slow response; this got thrown on my "to think about and
> look at later" pile.
>
> Yeah, I agree that if we sanity-checked the encoding up front, that
> would cover the case we saw in practice, and goes a long way towards
> catching any practical errors.
>
> But I think this patch is tricky:
>
>> diff --git a/environment.c b/environment.c
>> index 43bb1b35ffe..c26b18f8e5c 100644
>> --- a/environment.c
>> +++ b/environment.c
>> @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)
>>  
>>  const char *get_log_output_encoding(void)
>>  {
>> -	return git_log_output_encoding ? git_log_output_encoding
>> +	const char *encoding = git_log_output_encoding ? git_log_output_encoding
>>  		: get_commit_output_encoding();
>> +#ifndef NO_ICONV
>> +	iconv_t conv;
>> +	conv = iconv_open(encoding, "UTF-8");
>> +	if (conv == (iconv_t) -1 && errno == EINVAL)
>> +		die_errno("the '%s' encoding is not known to iconv", encoding);
>> +#else
>> +	if (strcmp(encoding, "UTF-8"))
>> +		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
>> +#endif
>> +	return encoding;
>>  }
>
> So one obvious problem here is that we call this function once per
> commit, so it's a lot of extra iconv_open() calls. But obviously we
> could use a static flag to do it once per process.

Yes, or the diff below, which seems like a better idea. I.e. stop
calling this in a loop if we know we'll need it, have setup_revisions()
populate it once.

> The other issue is that it is assuming UTF-8 on one end of the
> conversion. But we aren't necessarily doing such a conversion; it
> depends on the commit's on-disk encoding, and the requested output
> encoding. In particular:
>
>   - if both of those match, we do not need to call iconv at all (see the
>     same_encoding() check in repo_logmsg_reencode()). With the patch
>     above, the NO_ICONV case would start to die() when both are say
>     iso8859-1, even though it currently works.
>
>   - likewise, even if you have iconv support, it's possible that your
>     preferred encoding is not compatible with utf8. In which case
>     iconv_open() may complain, even though the actual conversion we'd
>     ask it to do would succeed.
>
> I.e., I don't think there's a way to just ask iconv "does this encoding
> name by itself make any sense". You can only ask it about to/from
> combos.

Yes, I'm not saying it covers the general problem, but that it covers
the specific complained-about issue of a completely nonsensical encoding
like "HTML". We should simply error on that on command startup, whether
or not we have any commits to visit.

But yes, if we want to cover specific encoding issues, e.g. not being
able to squash CJK UTF-8 into US-ASCII that needs to be per-commit,
but...

> So I think a much better version of this is to catch the _actual_
> iconv_open() call we make. And if it fails, say "woah, this combo of
> encodings isn't supported". The reason I didn't do that in the earlier
> patch is that all of this is obscured inside reencode_string_len(),
> which does both the iconv_open() and the iconv() call. We could surface
> that error information.
>
> But I'm not sure it would make sense to die() in that case. While for
> something like "git log --encoding=nonsense" every commit is going to
> fail to re-encode, it's still possible that iconv_open() failures are
> commit-specific. I.e., you could have some garbage commit in your
> history with an unsupported encoding, and you wouldn't want to die() for
> it (it's the same case you are complaining about having a warning for,
> but much worse).
>
> I suspect the best we could do along these lines is to wait until a real
> iconv_open(to, from) fails, and then as a fallback try:
>
>   iconv_open("UTF-8", from);
>   iconv_open(to, "UTF-8");
>
> to sanity-check them individually, and guess that one of them is broken
> if it can't go to/from UTF-8. But even that feels like it's making
> assumptions about both the system iconv, and the charsets people use.
>
> To be clear, I'd expect that most people just use utf-8 in the first
> place, and even if they don't that their system has some basic utf-8
> support. But we are deep into the realm of weird corner cases here, and
> the utility of this warning / error-checking doesn't seem high enough to
> merit the possible regressions we'd get by trying to make too many
> assumptions.

I still think the right move here is to just revert your patch. Yes we
can think of a bunch of tricky edge cases that need to be fixed, but
this is a long-standing problem, and the change as-is has some pretty
bad UI problems.

I think any change that solves those (probably starting with the below)
is going to be relatively major this close to release.

So per <87ily7m1mv.fsf@evledraar.gmail.com> why can't we just revert the
warning(), and then consider a good way forward that covers some/all of
these cases we've noted?

Which should probably start with the below diff, so we can clearly
distinguish startup bad encoding names from runtime ones (we'd put that
proposed "die if iconv_open() doesn't like it" somewhere in
setup_revisions()).

diff --git a/builtin/log.c b/builtin/log.c
index f75d87e8d7f..2b3a607e947 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -539,7 +539,7 @@ static void show_tagger(const char *buf, struct rev_info *rev)
 
 	pp.fmt = rev->commit_format;
 	pp.date_mode = rev->date_mode;
-	pp_user_info(&pp, "Tagger", &out, buf, get_log_output_encoding());
+	pp_user_info(&pp, "Tagger", &out, buf, rev->log_output_encoding);
 	fprintf(rev->diffopt.file, "%s", out.buf);
 	strbuf_release(&out);
 }
@@ -1208,7 +1208,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file,
 	log.file = rev->diffopt.file;
 	log.groups = SHORTLOG_GROUP_AUTHOR;
 	for (i = 0; i < nr; i++)
-		shortlog_add_commit(&log, list[i]);
+		shortlog_add_commit(rev, &log, list[i]);
 
 	shortlog_output(&log);
 
diff --git a/builtin/merge.c b/builtin/merge.c
index ea3112e0c0b..1a8fbafc149 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -431,6 +431,7 @@ static void squash_message(struct commit *commit, struct commit_list *remotehead
 	ctx.abbrev = rev.abbrev;
 	ctx.date_mode = rev.date_mode;
 	ctx.fmt = rev.commit_format;
+	ctx.output_encoding = rev.log_output_encoding;
 
 	strbuf_addstr(&out, "Squashed commit of the following:\n");
 	while ((commit = get_revision(&rev)) != NULL) {
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 36cb909ebaa..905ba7462f3 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -165,7 +165,7 @@ static void show_commit(struct commit *commit, void *data)
 		ctx.date_mode = revs->date_mode;
 		ctx.date_mode_explicit = revs->date_mode_explicit;
 		ctx.fmt = revs->commit_format;
-		ctx.output_encoding = get_log_output_encoding();
+		ctx.output_encoding = revs->log_output_encoding;
 		ctx.color = revs->diffopt.use_color;
 		pretty_print_commit(&ctx, commit, &buf);
 		if (buf.len) {
diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index e7f7af5de3f..2e5409a4fcc 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -198,7 +198,8 @@ static void insert_records_from_trailers(struct shortlog *log,
 	unuse_commit_buffer(commit, commit_buffer);
 }
 
-void shortlog_add_commit(struct shortlog *log, struct commit *commit)
+void shortlog_add_commit(struct rev_info *rev, struct shortlog *log,
+			 struct commit *commit)
 {
 	struct strbuf ident = STRBUF_INIT;
 	struct strbuf oneline = STRBUF_INIT;
@@ -210,7 +211,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	ctx.abbrev = log->abbrev;
 	ctx.print_email_subject = 1;
 	ctx.date_mode.type = DATE_NORMAL;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = rev->log_output_encoding;
 
 	if (!log->summary) {
 		if (log->user_format)
@@ -254,7 +255,7 @@ static void get_from_rev(struct rev_info *rev, struct shortlog *log)
 	if (prepare_revision_walk(rev))
 		die(_("revision walk setup failed"));
 	while ((commit = get_revision(rev)) != NULL)
-		shortlog_add_commit(log, commit);
+		shortlog_add_commit(rev, log, commit);
 }
 
 static int parse_uint(char const **arg, int comma, int defval)
diff --git a/bundle.c b/bundle.c
index a0bb687b0f4..32d74ab0bee 100644
--- a/bundle.c
+++ b/bundle.c
@@ -448,6 +448,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs)
 
 struct bundle_prerequisites_info {
 	struct object_array *pending;
+	struct rev_info *rev;
 	int fd;
 };
 
@@ -464,7 +465,7 @@ static void write_bundle_prerequisites(struct commit *commit, void *data)
 	write_or_die(bpi->fd, buf.buf, buf.len);
 
 	ctx.fmt = CMIT_FMT_ONELINE;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = bpi->rev->log_output_encoding;
 	strbuf_reset(&buf);
 	pretty_print_commit(&ctx, commit, &buf);
 	strbuf_trim(&buf);
@@ -544,6 +545,7 @@ int create_bundle(struct repository *r, const char *path,
 		die("revision walk setup failed");
 	bpi.fd = bundle_fd;
 	bpi.pending = &revs_copy.pending;
+	bpi.rev = &revs;
 	traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
 	object_array_remove_duplicates(&revs_copy.pending);
 
diff --git a/log-tree.c b/log-tree.c
index 644893fd8cf..d79324d5bfd 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -737,7 +737,7 @@ void show_log(struct rev_info *opt)
 
 		raw = (opt->commit_format == CMIT_FMT_USERFORMAT);
 		format_display_notes(&commit->object.oid, &notebuf,
-				     get_log_output_encoding(), raw);
+				     opt->log_output_encoding, raw);
 		ctx.notes_message = strbuf_detach(&notebuf, NULL);
 	}
 
@@ -758,7 +758,7 @@ void show_log(struct rev_info *opt)
 	ctx.mailmap = opt->mailmap;
 	ctx.color = opt->diffopt.use_color;
 	ctx.expand_tabs_in_log = opt->expand_tabs_in_log;
-	ctx.output_encoding = get_log_output_encoding();
+	ctx.output_encoding = opt->log_output_encoding;
 	ctx.rev = opt;
 	if (opt->from_ident.mail_begin && opt->from_ident.name_begin)
 		ctx.from_ident = &opt->from_ident;
diff --git a/pretty.c b/pretty.c
index fe95107ae5a..6eb64e8189d 100644
--- a/pretty.c
+++ b/pretty.c
@@ -2048,15 +2048,17 @@ void pretty_print_commit(struct pretty_print_context *pp,
 	int indent = 4;
 	const char *msg;
 	const char *reencoded;
-	const char *encoding;
+	const char *encoding = pp->output_encoding;
 	int need_8bit_cte = pp->need_8bit_cte;
 
+	if (!encoding)
+		BUG("should have .output_encoding");
+
 	if (pp->fmt == CMIT_FMT_USERFORMAT) {
 		format_commit_message(commit, user_format, sb, pp);
 		return;
 	}
 
-	encoding = get_log_output_encoding();
 	msg = reencoded = logmsg_reencode(commit, NULL, encoding);
 
 	if (pp->fmt == CMIT_FMT_ONELINE || cmit_fmt_is_mail(pp->fmt))
@@ -2121,6 +2123,14 @@ void pp_commit_easy(enum cmit_fmt fmt, const struct commit *commit,
 		    struct strbuf *sb)
 {
 	struct pretty_print_context pp = {0};
+	static const char *output_encoding;
+
+	if (!output_encoding) {
+		const char *tmp = get_log_output_encoding();
+		output_encoding = tmp;
+	}
+	
 	pp.fmt = fmt;
+	pp.output_encoding = output_encoding;
 	pretty_print_commit(&pp, commit, sb);
 }
diff --git a/remote-curl.c b/remote-curl.c
index d69156312bd..733a525bc73 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -345,8 +345,14 @@ static void free_discovery(struct discovery *d)
 static int show_http_message(struct strbuf *type, struct strbuf *charset,
 			     struct strbuf *msg)
 {
+	static const char *output_encoding;
 	const char *p, *eol;
 
+	if (!output_encoding) {
+		const char *tmp = get_log_output_encoding();
+		output_encoding = tmp;
+	}
+
 	/*
 	 * We only show text/plain parts, as other types are likely
 	 * to be ugly to look at on the user's terminal.
@@ -354,7 +360,7 @@ static int show_http_message(struct strbuf *type, struct strbuf *charset,
 	if (strcmp(type->buf, "text/plain"))
 		return -1;
 	if (charset->len)
-		strbuf_reencode(msg, charset->buf, get_log_output_encoding());
+		strbuf_reencode(msg, charset->buf, output_encoding);
 
 	strbuf_trim(msg);
 	if (!msg->len)
diff --git a/revision.c b/revision.c
index ab7c1358042..0b2ad87f28e 100644
--- a/revision.c
+++ b/revision.c
@@ -2866,7 +2866,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 				 &revs->grep_filter);
-	if (!is_encoding_utf8(get_log_output_encoding()))
+	revs->log_output_encoding = get_log_output_encoding();
+	if (!is_encoding_utf8(revs->log_output_encoding))
 		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
diff --git a/revision.h b/revision.h
index 5578bb4720a..bf9dbca727e 100644
--- a/revision.h
+++ b/revision.h
@@ -237,6 +237,7 @@ struct rev_info {
 	struct string_list *ref_message_ids;
 	int		add_signoff;
 	const char	*extra_headers;
+	const char	*log_output_encoding;
 	const char	*log_reencode;
 	const char	*subject_prefix;
 	int		patch_name_max;
diff --git a/shortlog.h b/shortlog.h
index 3f7e9aabcae..f809617f8a0 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -30,7 +30,9 @@ struct shortlog {
 
 void shortlog_init(struct shortlog *log);
 
-void shortlog_add_commit(struct shortlog *log, struct commit *commit);
+struct rev_info;
+void shortlog_add_commit(struct rev_info *rev, struct shortlog *log,
+			 struct commit *commit);
 
 void shortlog_output(struct shortlog *log);
 
diff --git a/submodule.c b/submodule.c
index c6890705241..f10e9c34ff6 100644
--- a/submodule.c
+++ b/submodule.c
@@ -501,7 +501,7 @@ static void print_submodule_diff_summary(struct repository *r, struct rev_info *
 	while ((commit = get_revision(rev))) {
 		struct pretty_print_context ctx = {0};
 		ctx.date_mode = rev->date_mode;
-		ctx.output_encoding = get_log_output_encoding();
+		ctx.output_encoding = rev->log_output_encoding;
 		strbuf_setlen(&sb, 0);
 		repo_format_commit_message(r, commit, format, &sb,
 				      &ctx);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason
@ 2021-10-29 20:40                   ` Jeff King
  2021-10-29 20:45                     ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff King @ 2021-10-29 20:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz

On Fri, Oct 29, 2021 at 12:47:36PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > The other issue is that it is assuming UTF-8 on one end of the
> > conversion. But we aren't necessarily doing such a conversion; it
> > depends on the commit's on-disk encoding, and the requested output
> > encoding. In particular:
> >
> >   - if both of those match, we do not need to call iconv at all (see the
> >     same_encoding() check in repo_logmsg_reencode()). With the patch
> >     above, the NO_ICONV case would start to die() when both are say
> >     iso8859-1, even though it currently works.
> >
> >   - likewise, even if you have iconv support, it's possible that your
> >     preferred encoding is not compatible with utf8. In which case
> >     iconv_open() may complain, even though the actual conversion we'd
> >     ask it to do would succeed.
> >
> > I.e., I don't think there's a way to just ask iconv "does this encoding
> > name by itself make any sense". You can only ask it about to/from
> > combos.
> 
> Yes, I'm not saying it covers the general problem, but that it covers
> the specific complained-about issue of a completely nonsensical encoding
> like "HTML". We should simply error on that on command startup, whether
> or not we have any commits to visit.

I definitely agree with you on the direction, and I don't mind if we
don't cover every case. What I was trying to point out above though is
that the patch you showed actually _regresses_ some cases, and it's hard
to robustly avoid that.

> So per <87ily7m1mv.fsf@evledraar.gmail.com> why can't we just revert the
> warning(), and then consider a good way forward that covers some/all of
> these cases we've noted?

Right, I agreed with that in the other thread. You may need to convince
Junio. ;)

TBH I am not even sure it is worth spending a lot of brain cells on the
"and then consider..." part. Over all these years, we've had one report,
and it simply misunderstand what "--encoding" was for. I thought it was
something we could fix up easily by checking a return value, but IMHO
doing it right is quite tricky because of iconv()'s limited interface,
and the risk of regression outweighs the potential benefit.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-29 20:40                   ` Jeff King
@ 2021-10-29 20:45                     ` Junio C Hamano
  2021-10-29 20:52                       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2021-10-29 20:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason,
	Krzysztof Żelechowski, git, Hamza Mahfooz

Jeff King <peff@peff.net> writes:

> TBH I am not even sure it is worth spending a lot of brain cells on the
> "and then consider..." part. Over all these years, we've had one report,
> and it simply misunderstand what "--encoding" was for. I thought it was
> something we could fix up easily by checking a return value, but IMHO
> doing it right is quite tricky because of iconv()'s limited interface,
> and the risk of regression outweighs the potential benefit.

I tend to agree with the above.  Let's not over-engineer things.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-29 20:45                     ` Junio C Hamano
@ 2021-10-29 20:52                       ` Junio C Hamano
  2021-10-29 21:10                         ` Jeff King
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2021-10-29 20:52 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason,
	Krzysztof Żelechowski, git, Hamza Mahfooz

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> TBH I am not even sure it is worth spending a lot of brain cells on the
>> "and then consider..." part. Over all these years, we've had one report,
>> and it simply misunderstand what "--encoding" was for. I thought it was
>> something we could fix up easily by checking a return value, but IMHO
>> doing it right is quite tricky because of iconv()'s limited interface,
>> and the risk of regression outweighs the potential benefit.
>
> I tend to agree with the above.  Let's not over-engineer things.

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
From: Junio C Hamano <gitster@pobox.com>
Date: Fri, 29 Oct 2021 13:48:58 -0700
Subject: [PATCH] Revert "logmsg_reencode(): warn when iconv() fails"

This reverts commit fd680bc5 (logmsg_reencode(): warn when iconv()
fails, 2021-08-27).  Throwing a warning for each and every commit
that gets reencoded, without allowing a way to squelch, would make
it unpleasant for folks who have to deal with an ancient part of the
history in an old project that used wrong encoding in the commits.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/pretty-options.txt | 4 +---
 pretty.c                         | 6 +-----
 t/t4210-log-i18n.sh              | 7 -------
 3 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index b3af850608..54d8bb3db0 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -40,9 +40,7 @@ people using 80-column terminals.
 	defaults to UTF-8. Note that if an object claims to be encoded
 	in `X` and we are outputting in `X`, we will output the object
 	verbatim; this means that invalid sequences in the original
-	commit may be copied to the output. Likewise, if iconv(3) fails
-	to convert the commit, we will output the original object
-	verbatim, along with a warning.
+	commit may be copied to the output.
 
 --expand-tabs=<n>::
 --expand-tabs::
diff --git a/pretty.c b/pretty.c
index 73b5ead509..9631529c10 100644
--- a/pretty.c
+++ b/pretty.c
@@ -671,11 +671,7 @@ const char *repo_logmsg_reencode(struct repository *r,
 	 * If the re-encoding failed, out might be NULL here; in that
 	 * case we just return the commit message verbatim.
 	 */
-	if (!out) {
-		warning("unable to reencode commit to '%s'", output_encoding);
-		return msg;
-	}
-	return out;
+	return out ? out : msg;
 }
 
 static int mailmap_name(const char **email, size_t *email_len,
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 0141f36e33..d2dfcf164e 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -131,11 +131,4 @@ do
 	fi
 done
 
-test_expect_success 'log shows warning when conversion fails' '
-	enc=this-encoding-does-not-exist &&
-	git log -1 --encoding=$enc 2>err &&
-	echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect &&
-	test_cmp expect err
-'
-
 test_done
-- 
2.33.1-1007-g607b33ccc6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: *Really* noisy encoding warnings post-v2.33.0
  2021-10-29 20:52                       ` Junio C Hamano
@ 2021-10-29 21:10                         ` Jeff King
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2021-10-29 21:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason,
	Krzysztof Żelechowski, git, Hamza Mahfooz

On Fri, Oct 29, 2021 at 01:52:02PM -0700, Junio C Hamano wrote:

> > I tend to agree with the above.  Let's not over-engineer things.
> 
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
> From: Junio C Hamano <gitster@pobox.com>
> Date: Fri, 29 Oct 2021 13:48:58 -0700
> Subject: [PATCH] Revert "logmsg_reencode(): warn when iconv() fails"
> 
> This reverts commit fd680bc5 (logmsg_reencode(): warn when iconv()
> fails, 2021-08-27).  Throwing a warning for each and every commit
> that gets reencoded, without allowing a way to squelch, would make
> it unpleasant for folks who have to deal with an ancient part of the
> history in an old project that used wrong encoding in the commits.

Thanks for tying this up.

I do think fd680bc5's documentation was good, though. So I'd suggest
squashing this in, or applying it on top.

-- >8 --
Subject: [PATCH] log: document --encoding behavior on iconv() failure

We already note that we may produce invalid output when we skip calling
iconv() altogether. But we may also do so if iconv() fails, and we have
no good alternative. Let's document this to avoid surprising users.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/pretty-options.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt
index 54d8bb3db0..dc685be363 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -40,7 +40,9 @@ people using 80-column terminals.
 	defaults to UTF-8. Note that if an object claims to be encoded
 	in `X` and we are outputting in `X`, we will output the object
 	verbatim; this means that invalid sequences in the original
-	commit may be copied to the output.
+	commit may be copied to the output. Likewise, if iconv(3) fails
+	to convert the commit, we will quietly output the original
+	object verbatim.
 
 --expand-tabs=<n>::
 --expand-tabs::
-- 
2.33.1.1446.g3ed047b518

^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-10-29 21:10 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24  9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-24 10:31 ` Bagas Sanjaya
2021-08-24 10:33   ` Krzysztof Żelechowski
2021-08-24 10:46     ` Bagas Sanjaya
2021-08-24 19:11       ` Junio C Hamano
2021-08-25  0:57 ` Jeff King
2021-08-25 16:31   ` Junio C Hamano
2021-08-27 18:30     ` Jeff King
2021-08-27 18:32       ` Jeff King
2021-08-27 19:47         ` Junio C Hamano
2021-10-09  0:58       ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason
2021-10-09  1:29         ` Ævar Arnfjörð Bjarmason
2021-10-09  2:36         ` Jeff King
2021-10-09  2:42           ` Jeff King
2021-10-09 13:47             ` Ævar Arnfjörð Bjarmason
2021-10-27 11:03               ` Jeff King
2021-10-29 10:47                 ` Ævar Arnfjörð Bjarmason
2021-10-29 20:40                   ` Jeff King
2021-10-29 20:45                     ` Junio C Hamano
2021-10-29 20:52                       ` Junio C Hamano
2021-10-29 21:10                         ` Jeff King
2021-10-22 22:58             ` Ævar Arnfjörð Bjarmason
2021-10-10 13:53           ` Johannes Sixt
2021-10-10 15:43             ` Ævar Arnfjörð Bjarmason
2021-08-25 23:00   ` git log --encoding=HTML is not supported Krzysztof Żelechowski
2021-08-27 18:33     ` Jeff King
2021-08-25 23:28   ` Krzysztof Żelechowski
2021-08-25 23:47     ` Bryan Turner
2021-08-26 15:37       ` Junio C Hamano
2021-08-26 20:52         ` Krzysztof Żelechowski
2021-08-27 15:59           ` Junio C Hamano
2021-08-27 18:37             ` Jeff King
2021-08-27 21:51               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).