* git log --encoding=HTML is not supported @ 2021-08-24 9:00 Krzysztof Żelechowski 2021-08-24 10:31 ` Bagas Sanjaya 2021-08-25 0:57 ` Jeff King 0 siblings, 2 replies; 33+ messages in thread From: Krzysztof Żelechowski @ 2021-08-24 9:00 UTC (permalink / raw) To: git Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem) { git log --oneline --encoding=HTML stl_function.h; } Co powinno się stać? (Oczekiwane zachowanie) 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> Co stało się zamiast tego? (Rzeczywiste zachowanie) 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> Jaka jest różnica między tym, co powinno się stać, a tym, co się stało? Znak początku nazwy pliku jest interpretowany jako znak otwierający znacznik. Inne cenne uwagi: Błąd u klienta: <URL: https://bugs.kde.org/show_bug.cgi?id=441255 > Implementacja wtyczki: "--pretty=format: <tr> <td><a href=\"rev:%h\">%h</a></td> <td>%ad</td> <td>%s</td> <td>%an</td> </tr>" Podobne zgłoszenie: <URL: https://public-inbox.org/git/CACPiFCLzsiUjx-vm-dcd=0E8HezMWkErPyS==OQ7OhaXqR6CUA@mail.gmail.com/ > Proponowane rozwiązanie: W odróżnieniu od omawianych powyżej trudności używaniem formatu wynikowego JSON, moim zdaniem w tym przypadku wystarczyłoby zakodować znaki [<] i [&] w treści w sposób odpowiedni dla HTML. (To rozwiązanie nie zakłada wykrywania i odrzucania znaków nieprawidłowych.) [Informacje o systemie] wersja gita: git version 2.32.0 cpu: x86_64 no commit associated with this build sizeof-long: 8 sizeof-size_t: 8 shell-path: /bin/sh uname: Linux 5.13.12-1-default #1 SMP Wed Aug 18 08:01:38 UTC 2021 (999e604) x86_64 informacje o kompilacji: gnuc: 11.1 informacje o bibliotece libc: glibc: 2.33 $SHELL (typically, interactive shell): /bin/bash [Włączone skrypty Gita] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-24 9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski @ 2021-08-24 10:31 ` Bagas Sanjaya 2021-08-24 10:33 ` Krzysztof Żelechowski 2021-08-25 0:57 ` Jeff King 1 sibling, 1 reply; 33+ messages in thread From: Bagas Sanjaya @ 2021-08-24 10:31 UTC (permalink / raw) To: Krzysztof Żelechowski, git On 24/08/21 16.00, Krzysztof Żelechowski wrote: > Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem) > { git log --oneline --encoding=HTML stl_function.h; } > > Co powinno się stać? (Oczekiwane zachowanie) > 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> > > Co stało się zamiast tego? (Rzeczywiste zachowanie) > 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> > > Jaka jest różnica między tym, co powinno się stać, a tym, co się stało? > Znak początku nazwy pliku jest interpretowany jako znak otwierający znacznik. > > Inne cenne uwagi: > Błąd u klienta: > <URL: https://bugs.kde.org/show_bug.cgi?id=441255 > > > Implementacja wtyczki: > "--pretty=format: > <tr> > <td><a href=\"rev:%h\">%h</a></td> <td>%ad</td> <td>%s</td> <td>%an</td> > </tr>" > > Podobne zgłoszenie: > <URL: https://public-inbox.org/git/CACPiFCLzsiUjx-vm-dcd=0E8HezMWkErPyS==OQ7OhaXqR6CUA@mail.gmail.com/ > > > Proponowane rozwiązanie: > W odróżnieniu od omawianych powyżej trudności używaniem formatu wynikowego > JSON, moim zdaniem w tym przypadku wystarczyłoby zakodować znaki [<] i [&] > w treści w sposób odpowiedni dla HTML. > (To rozwiązanie nie zakłada wykrywania i odrzucania znaków nieprawidłowych.) > > [Informacje o systemie] > wersja gita: > git version 2.32.0 > cpu: x86_64 > no commit associated with this build > sizeof-long: 8 > sizeof-size_t: 8 > shell-path: /bin/sh > uname: Linux 5.13.12-1-default #1 SMP Wed Aug 18 08:01:38 UTC 2021 (999e604) > x86_64 > informacje o kompilacji: gnuc: 11.1 > informacje o bibliotece libc: glibc: 2.33 > $SHELL (typically, interactive shell): /bin/bash > > > [Włączone skrypty Gita] > > > Please speak English here (in other words, re-submit git-bugreport without l10n). -- An old man doll... just what I always wanted! - Clara ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-24 10:31 ` Bagas Sanjaya @ 2021-08-24 10:33 ` Krzysztof Żelechowski 2021-08-24 10:46 ` Bagas Sanjaya 0 siblings, 1 reply; 33+ messages in thread From: Krzysztof Żelechowski @ 2021-08-24 10:33 UTC (permalink / raw) To: git, Bagas Sanjaya Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze: > Please speak English here (in other words, re-submit git-bugreport > without l10n). How do I do that? Chris ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-24 10:33 ` Krzysztof Żelechowski @ 2021-08-24 10:46 ` Bagas Sanjaya 2021-08-24 19:11 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Bagas Sanjaya @ 2021-08-24 10:46 UTC (permalink / raw) To: Krzysztof Żelechowski, git On 24/08/21 17.33, Krzysztof Żelechowski wrote: > Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze: > >> Please speak English here (in other words, re-submit git-bugreport >> without l10n). > > How do I do that? You need to set locale to English when executing `git bugreport`: ``` LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 /path/to/git bugreport ``` -- An old man doll... just what I always wanted! - Clara ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-24 10:46 ` Bagas Sanjaya @ 2021-08-24 19:11 ` Junio C Hamano 0 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2021-08-24 19:11 UTC (permalink / raw) To: Emily Shaffer; +Cc: Bagas Sanjaya, Krzysztof Żelechowski, git Bagas Sanjaya <bagasdotme@gmail.com> writes: > On 24/08/21 17.33, Krzysztof Żelechowski wrote: >> Dnia wtorek, 24 sierpnia 2021 12:31:14 CEST Bagas Sanjaya pisze: >> >>> Please speak English here (in other words, re-submit git-bugreport >>> without l10n). >> How do I do that? > > You need to set locale to English when executing `git bugreport`: > > ``` > LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 /path/to/git bugreport > ``` Emily, what's your take on this exchange? I recall that many people (me included) went for "user friendlyness" by pushing to localize the questionnaire, but here, it seems to be backfiring at us. I personally think that it is OK to give a localized questionnaire and let volunteers who can speak the language help non-English speakers, but at the same time, it may be a good idea to hint that filling in the answers in English would give a better chance for their problems to be looked at (in a localized message). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-24 9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski 2021-08-24 10:31 ` Bagas Sanjaya @ 2021-08-25 0:57 ` Jeff King 2021-08-25 16:31 ` Junio C Hamano ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Jeff King @ 2021-08-25 0:57 UTC (permalink / raw) To: Krzysztof Żelechowski; +Cc: git On Tue, Aug 24, 2021 at 11:00:03AM +0200, Krzysztof Żelechowski wrote: > Co robiłeś/-aś zanim pojawił się błąd? (Kroki, aby odtworzyć problem) > { git log --oneline --encoding=HTML stl_function.h; } > > Co powinno się stać? (Oczekiwane zachowanie) > 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> > > Co stało się zamiast tego? (Rzeczywiste zachowanie) > 828176ba490 libstdc++: Improve doxygen comments in <bits/stl_function.h> I can't read the non-English parts of the email, but I gather you were expecting "--encoding=HTML" to escape syntactically significant HTML characters. It's not that kind of "encoding", but more "which character set are you using" (utf8 vs iso8859-1, etc). We feed the encoding "HTML" to iconv_open(), which of course has no idea what that is. It's unfortunate, though, that we don't even print a warning, and instead just quietly leave the text intact. I wonder if we should do something like: diff --git a/pretty.c b/pretty.c index 535eb97fa6..708b618cfe 100644 --- a/pretty.c +++ b/pretty.c @@ -672,7 +672,11 @@ const char *repo_logmsg_reencode(struct repository *r, * If the re-encoding failed, out might be NULL here; in that * case we just return the commit message verbatim. */ - return out ? out : msg; + if (!out) { + warning("unable to reencode commit to '%s'", output_encoding); + return msg; + } + return out; } static int mailmap_name(const char **email, size_t *email_len, As far as what you're trying to accomplish, HTML-escaping isn't something Git supports. You'll have to run the output through an external escaping mechanism. -Peff ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 0:57 ` Jeff King @ 2021-08-25 16:31 ` Junio C Hamano 2021-08-27 18:30 ` Jeff King 2021-08-25 23:00 ` git log --encoding=HTML is not supported Krzysztof Żelechowski 2021-08-25 23:28 ` Krzysztof Żelechowski 2 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2021-08-25 16:31 UTC (permalink / raw) To: Jeff King; +Cc: Krzysztof Żelechowski, git Jeff King <peff@peff.net> writes: > We feed the encoding "HTML" to iconv_open(), which of course has no idea > what that is. It's unfortunate, though, that we don't even print a > warning, and instead just quietly leave the text intact. I wonder if we > should do something like: > > diff --git a/pretty.c b/pretty.c > index 535eb97fa6..708b618cfe 100644 > --- a/pretty.c > +++ b/pretty.c > @@ -672,7 +672,11 @@ const char *repo_logmsg_reencode(struct repository *r, > * If the re-encoding failed, out might be NULL here; in that > * case we just return the commit message verbatim. > */ > - return out ? out : msg; > + if (!out) { > + warning("unable to reencode commit to '%s'", output_encoding); > + return msg; > + } > + return out; > } > > static int mailmap_name(const char **email, size_t *email_len, This addition sounds quite sensible to me. "git log --encoding=bogus" would issue this warning for each and every commit and that may be a bit irritating, but being irritating may be a good characteristic for a warning message that is given to an easily correctable condition. I originally thought that the warning would be lost to the pager, but apparently I forgot what I did eons ago at 61b80509 (sending errors to stdout under $PAGER, 2008-02-16) ;-). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 16:31 ` Junio C Hamano @ 2021-08-27 18:30 ` Jeff King 2021-08-27 18:32 ` Jeff King 2021-10-09 0:58 ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 33+ messages in thread From: Jeff King @ 2021-08-27 18:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Krzysztof Żelechowski, git On Wed, Aug 25, 2021 at 09:31:59AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > We feed the encoding "HTML" to iconv_open(), which of course has no idea > > what that is. It's unfortunate, though, that we don't even print a > > warning, and instead just quietly leave the text intact. I wonder if we > > should do something like: > [...] > This addition sounds quite sensible to me. > > "git log --encoding=bogus" would issue this warning for each and > every commit and that may be a bit irritating, but being irritating > may be a good characteristic for a warning message that is given to > an easily correctable condition. > > I originally thought that the warning would be lost to the pager, > but apparently I forgot what I did eons ago at 61b80509 (sending > errors to stdout under $PAGER, 2008-02-16) ;-). Here it is polished into a real commit. -- >8 -- Subject: [PATCH] logmsg_reencode(): warn when iconv() fails If the user asks for a pretty-printed commit to be converted (either explicitly with --encoding=foo, or implicitly because the commit is non-utf8 and we want to convert it), we pass it through iconv(). If that fails, we fall back to showing the input verbatim, but don't tell the user that the output may be bogus. Let's add a warning to do so, along with a mention in the documentation for --encoding. Two things to note about the implementation: - we could produce the warning closer to the call to iconv() in reencode_string_len(), which would let us relay the value of errno. But this is not actually very helpful. reencode_string_len() does not know we are operating on a commit, and indeed does not know that the caller won't produce an error of its own. And the errno values from iconv() are seldom helpful (iconv_open() only ever produces EINVAL; perhaps EILSEQ from iconv() might be illuminating, but it can also return EINVAL for incomplete sequences). - if the reason for the failure is that the output charset is not supported, then the user will see this warning for every commit we try to display. That might be ugly and overwhelming, but on the other hand it is making it clear that every one of them has not been converted (and the likely outcome anyway is to re-try the command with a supported output encoding). Signed-off-by: Jeff King <peff@peff.net> --- Documentation/pretty-options.txt | 4 +++- pretty.c | 6 +++++- t/t4210-log-i18n.sh | 7 +++++++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt index 27ddaf84a1..42b227bc40 100644 --- a/Documentation/pretty-options.txt +++ b/Documentation/pretty-options.txt @@ -40,7 +40,9 @@ people using 80-column terminals. defaults to UTF-8. Note that if an object claims to be encoded in `X` and we are outputting in `X`, we will output the object verbatim; this means that invalid sequences in the original - commit may be copied to the output. + commit may be copied to the output. Likewise, if iconv(3) fails + to convert the commit, we will output the original object + verbatim, along with a warning. --expand-tabs=<n>:: --expand-tabs:: diff --git a/pretty.c b/pretty.c index 9631529c10..73b5ead509 100644 --- a/pretty.c +++ b/pretty.c @@ -671,7 +671,11 @@ const char *repo_logmsg_reencode(struct repository *r, * If the re-encoding failed, out might be NULL here; in that * case we just return the commit message verbatim. */ - return out ? out : msg; + if (!out) { + warning("unable to reencode commit to '%s'", output_encoding); + return msg; + } + return out; } static int mailmap_name(const char **email, size_t *email_len, diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index d2dfcf164e..0141f36e33 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -131,4 +131,11 @@ do fi done +test_expect_success 'log shows warning when conversion fails' ' + enc=this-encoding-does-not-exist && + git log -1 --encoding=$enc 2>err && + echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect && + test_cmp expect err +' + test_done -- 2.33.0.396.g72f622fe47 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-27 18:30 ` Jeff King @ 2021-08-27 18:32 ` Jeff King 2021-08-27 19:47 ` Junio C Hamano 2021-10-09 0:58 ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 33+ messages in thread From: Jeff King @ 2021-08-27 18:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: Krzysztof Żelechowski, git On Fri, Aug 27, 2021 at 02:30:16PM -0400, Jeff King wrote: > Here it is polished into a real commit. > > Subject: [PATCH] logmsg_reencode(): warn when iconv() fails And here's a minimal documentation I'd suggest on top. We can discuss going further in discussing subtleties of iconv() if we want, but IMHO it would work to stop here. -- >8 -- Subject: [PATCH] docs: use "character encoding" to refer to commit-object encoding The word "encoding" can mean a lot of things (e.g., base64 or quoted-printable encoding in emails, HTML entities, URL encoding, and so on). The documentation for i18n.commitEncoding and i18n.logOutputEncoding uses the phrase "character encoding" to make this more clear. Let's use that phrase in other places to make it clear what kind of encoding we are talking about. This patch covers the gui.encoding option, as well as the --encoding option for git-log, etc (in this latter case, I word-smithed the sentence a little at the same time). That, coupled with the mention of iconv in the --encoding description, should make this more clear. The other spot I looked at is the working-tree-encoding section of gitattributes(5). But it gives specific examples of encodings that I think make the meaning pretty clear already. Signed-off-by: Jeff King <peff@peff.net> --- Documentation/config/gui.txt | 2 +- Documentation/pretty-options.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/config/gui.txt b/Documentation/config/gui.txt index d30831a130..0c087fd8c9 100644 --- a/Documentation/config/gui.txt +++ b/Documentation/config/gui.txt @@ -11,7 +11,7 @@ gui.displayUntracked:: in the file list. The default is "true". gui.encoding:: - Specifies the default encoding to use for displaying of + Specifies the default character encoding to use for displaying of file contents in linkgit:git-gui[1] and linkgit:gitk[1]. It can be overridden by setting the 'encoding' attribute for relevant files (see linkgit:gitattributes[5]). diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt index 42b227bc40..b3af850608 100644 --- a/Documentation/pretty-options.txt +++ b/Documentation/pretty-options.txt @@ -33,7 +33,7 @@ people using 80-column terminals. used together. --encoding=<encoding>:: - The commit objects record the encoding used for the log message + Commit objects record the character encoding used for the log message in their encoding header; this option can be used to tell the command to re-code the commit log message in the encoding preferred by the user. For non plumbing commands this -- 2.33.0.396.g72f622fe47 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-27 18:32 ` Jeff King @ 2021-08-27 19:47 ` Junio C Hamano 0 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2021-08-27 19:47 UTC (permalink / raw) To: Jeff King; +Cc: Krzysztof Żelechowski, git Jeff King <peff@peff.net> writes: > On Fri, Aug 27, 2021 at 02:30:16PM -0400, Jeff King wrote: > >> Here it is polished into a real commit. >> >> Subject: [PATCH] logmsg_reencode(): warn when iconv() fails > > And here's a minimal documentation I'd suggest on top. We can discuss > going further in discussing subtleties of iconv() if we want, but IMHO > it would work to stop here. Thanks, both patches look sensible. Will queue. ^ permalink raw reply [flat|nested] 33+ messages in thread
* *Really* noisy encoding warnings post-v2.33.0 2021-08-27 18:30 ` Jeff King 2021-08-27 18:32 ` Jeff King @ 2021-10-09 0:58 ` Ævar Arnfjörð Bjarmason 2021-10-09 1:29 ` Ævar Arnfjörð Bjarmason 2021-10-09 2:36 ` Jeff King 1 sibling, 2 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-09 0:58 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Aug 27 2021, Jeff King wrote: $subject because I think we should really consider backing this out before it gets to a real release. I ran into this while testing the grep coloring patch[1] (but it's unrelated). Before this commit e.g.: LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l 28333 So ~3k lines for my last 100 commits, but then: $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning 299 At first I thought it was spewing warnings for every failed re-encoded line in some cases, because I get hundreds at a time sometimes, but it's because stderr and stdout I/O buffering is different (a common case). Adding a "fflush(stderr)" "fixes" that. But anyway, I think we've got a lot of users who say *do* want to reencode something from say UTF-8 to latin1, but then might have the occasional non-latin1 representable data. The old behavior of silently falling back is going to be much better for those users, or maybe show one warning at the end or something, if you feel it really needs to be kept. > On Wed, Aug 25, 2021 at 09:31:59AM -0700, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >> > We feed the encoding "HTML" to iconv_open(), which of course has no idea >> > what that is. It's unfortunate, though, that we don't even print a >> > warning, and instead just quietly leave the text intact. I wonder if we >> > should do something like: >> [...] >> This addition sounds quite sensible to me. >> >> "git log --encoding=bogus" would issue this warning for each and >> every commit and that may be a bit irritating, but being irritating >> may be a good characteristic for a warning message that is given to >> an easily correctable condition. >> >> I originally thought that the warning would be lost to the pager, >> but apparently I forgot what I did eons ago at 61b80509 (sending >> errors to stdout under $PAGER, 2008-02-16) ;-). > > Here it is polished into a real commit. > > -- >8 -- > Subject: [PATCH] logmsg_reencode(): warn when iconv() fails > > If the user asks for a pretty-printed commit to be converted (either > explicitly with --encoding=foo, or implicitly because the commit is > non-utf8 and we want to convert it), we pass it through iconv(). If that > fails, we fall back to showing the input verbatim, but don't tell the > user that the output may be bogus. > > Let's add a warning to do so, along with a mention in the documentation > for --encoding. Two things to note about the implementation: > > - we could produce the warning closer to the call to iconv() in > reencode_string_len(), which would let us relay the value of errno. > But this is not actually very helpful. reencode_string_len() does > not know we are operating on a commit, and indeed does not know that > the caller won't produce an error of its own. And the errno values > from iconv() are seldom helpful (iconv_open() only ever produces > EINVAL; perhaps EILSEQ from iconv() might be illuminating, but it > can also return EINVAL for incomplete sequences). > > - if the reason for the failure is that the output charset is not > supported, then the user will see this warning for every commit we > try to display. That might be ugly and overwhelming, but on the > other hand it is making it clear that every one of them has not been > converted (and the likely outcome anyway is to re-try the command > with a supported output encoding). > > Signed-off-by: Jeff King <peff@peff.net> > --- > Documentation/pretty-options.txt | 4 +++- > pretty.c | 6 +++++- > t/t4210-log-i18n.sh | 7 +++++++ > 3 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt > index 27ddaf84a1..42b227bc40 100644 > --- a/Documentation/pretty-options.txt > +++ b/Documentation/pretty-options.txt > @@ -40,7 +40,9 @@ people using 80-column terminals. > defaults to UTF-8. Note that if an object claims to be encoded > in `X` and we are outputting in `X`, we will output the object > verbatim; this means that invalid sequences in the original > - commit may be copied to the output. > + commit may be copied to the output. Likewise, if iconv(3) fails > + to convert the commit, we will output the original object > + verbatim, along with a warning. > > --expand-tabs=<n>:: > --expand-tabs:: > diff --git a/pretty.c b/pretty.c > index 9631529c10..73b5ead509 100644 > --- a/pretty.c > +++ b/pretty.c > @@ -671,7 +671,11 @@ const char *repo_logmsg_reencode(struct repository *r, > * If the re-encoding failed, out might be NULL here; in that > * case we just return the commit message verbatim. > */ > - return out ? out : msg; > + if (!out) { > + warning("unable to reencode commit to '%s'", output_encoding); > + return msg; > + } > + return out; > } > > static int mailmap_name(const char **email, size_t *email_len, > diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh > index d2dfcf164e..0141f36e33 100755 > --- a/t/t4210-log-i18n.sh > +++ b/t/t4210-log-i18n.sh > @@ -131,4 +131,11 @@ do > fi > done > > +test_expect_success 'log shows warning when conversion fails' ' > + enc=this-encoding-does-not-exist && > + git log -1 --encoding=$enc 2>err && > + echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect && > + test_cmp expect err > +' > + > test_done ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 0:58 ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason @ 2021-10-09 1:29 ` Ævar Arnfjörð Bjarmason 2021-10-09 2:36 ` Jeff King 1 sibling, 0 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-09 1:29 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Sat, Oct 09 2021, Ævar Arnfjörð Bjarmason wrote: > On Fri, Aug 27 2021, Jeff King wrote: > > $subject because I think we should really consider backing this out > before it gets to a real release. > > I ran into this while testing the grep coloring patch[1] (but it's > unrelated). Before this commit e.g.: > > LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l > 28333 > > So ~3k lines for my last 100 commits, but then: > > $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning > 299 > > At first I thought it was spewing warnings for every failed re-encoded > line in some cases, because I get hundreds at a time sometimes, but it's > because stderr and stdout I/O buffering is different (a common > case). Adding a "fflush(stderr)" "fixes" that. It's partially that, but also more pathologically: $ git -P -c i18n.commitEncoding=ascii log --author=doesnotexist -1 | wc -l 0 $ git -P -c i18n.commitEncoding=ascii log --author=doesnotexist -1 2>&1 |wc -l 6688 I.e. even if we don't end up emitting anything we'll warn, of course we might not match *because* we failed, e.g. if you had a non-ascii --grep string, but in this case it's rather noisy. > But anyway, I think we've got a lot of users who say *do* want to > reencode something from say UTF-8 to latin1, but then might have the > occasional non-latin1 representable data. The old behavior of silently > falling back is going to be much better for those users, or maybe show > one warning at the end or something, if you feel it really needs to be > kept. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 0:58 ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason 2021-10-09 1:29 ` Ævar Arnfjörð Bjarmason @ 2021-10-09 2:36 ` Jeff King 2021-10-09 2:42 ` Jeff King 2021-10-10 13:53 ` Johannes Sixt 1 sibling, 2 replies; 33+ messages in thread From: Jeff King @ 2021-10-09 2:36 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote: > I ran into this while testing the grep coloring patch[1] (but it's > unrelated). Before this commit e.g.: > > LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l > 28333 > > So ~3k lines for my last 100 commits, but then: > > $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning > 299 > > At first I thought it was spewing warnings for every failed re-encoded > line in some cases, because I get hundreds at a time sometimes, but it's > because stderr and stdout I/O buffering is different (a common > case). Adding a "fflush(stderr)" "fixes" that. I don't think the buffering is the issue. By default stderr flushes on lines, and we flush commits after showing them. If you take away "-P" (or look at the combined 2>&1 output in order), you'll see that they are grouped. Now one thing you might notice is that there may be multiple warnings between output commits. But that's because we really are re-encoding each of those intermediate commits to do your --author grep. And if that re-encoding fails, we may well be producing the wrong output, because the matching won't be correct (in your case, presumably the correct output should be _nothing_, because Æ is not an ascii character). I do think the current warning is particularly bad there, because it doesn't even mention the commit oid. So something like: diff --git a/pretty.c b/pretty.c index 708b618cfe..ddf501632d 100644 --- a/pretty.c +++ b/pretty.c @@ -673,7 +673,8 @@ const char *repo_logmsg_reencode(struct repository *r, * case we just return the commit message verbatim. */ if (!out) { - warning("unable to reencode commit to '%s'", output_encoding); + warning("unable to reencode commit %s to '%s'", + oid_to_hex(&commit->object.oid), output_encoding); return msg; } return out; means you get output like: $ git -c i18n.commitEncoding=ascii log --format='%h %s' --author=Ævar -100 warning: unable to reencode commit c90cfc225baaf64af311f7e2953267e4de636205 to 'ascii' warning: unable to reencode commit 1d1d731d30cbcd5f3a6a5cbac1fe218e4d4db72b to 'ascii' warning: unable to reencode commit 66237bcf60df357f188551e1ea4db90f94c519ae to 'ascii' warning: unable to reencode commit 100c2da2d3a330366588143d720f09a88926972a to 'ascii' warning: unable to reencode commit 59580685bee17de3efff614df7f508133d1e4a7a to 'ascii' 59580685be config.h: remove unused git_config_get_untracked_cache() declaration warning: unable to reencode commit 067e73c8aee9aeb05eac939205274cd2ad8b7cae to 'ascii' 067e73c8ae log-tree.h: remove unused function declarations [...etc...] If that were coupled with, say, an advise() call to explain that output and matching might be inaccurate (and show that _once_), that might might it more clear what's going on. Now I am sympathetic to flooding the user with too many messages, and maybe reducing this to a single instance of "some commit messages could not be re-encoded; output and matching might be inaccurate" is the right thing. But in a sense, it's also working as designed: what you asked for is producing wrong output over and over, and Git is saying so. I'm not even sure what you're trying to do with that command. It could never output a single correct commit, because you've asked to match only commits that will be shown in the wrong encoding. > But anyway, I think we've got a lot of users who say *do* want to > reencode something from say UTF-8 to latin1, but then might have the > occasional non-latin1 representable data. The old behavior of silently > falling back is going to be much better for those users, or maybe show > one warning at the end or something, if you feel it really needs to be > kept. If there are real-world cases where the quantity of errors is really getting in the way, I'm open to the idea of having a single error message. And personally, I don't really have any experience working with broken encodings (all my commits are in utf8, and that's what I use as output). It just seems weird to me that 'git log --encoding=foo' would quietly ignore the option entirely (i.e., the old behavior, which did lead to a confused user and a post to the list). -Peff ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 2:36 ` Jeff King @ 2021-10-09 2:42 ` Jeff King 2021-10-09 13:47 ` Ævar Arnfjörð Bjarmason 2021-10-22 22:58 ` Ævar Arnfjörð Bjarmason 2021-10-10 13:53 ` Johannes Sixt 1 sibling, 2 replies; 33+ messages in thread From: Jeff King @ 2021-10-09 2:42 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote: > If that were coupled with, say, an advise() call to explain that output > and matching might be inaccurate (and show that _once_), that might > might it more clear what's going on. > > Now I am sympathetic to flooding the user with too many messages, and > maybe reducing this to a single instance of "some commit messages could > not be re-encoded; output and matching might be inaccurate" is the right > thing. But in a sense, it's also working as designed: what you asked for > is producing wrong output over and over, and Git is saying so. The single-output version would perhaps be something like this: diff --git a/pretty.c b/pretty.c index 708b618cfe..c86f41bae7 100644 --- a/pretty.c +++ b/pretty.c @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding) return strbuf_detach(&tmp, NULL); } +static void show_encoding_warning(const char *output_encoding) +{ + static int seen_warning; + + if (seen_warning) + return; + + seen_warning = 1; + warning("one or more commits could not be re-encoded to '%s'", + output_encoding); + advise("When re-encoding fails, some output may be in an unexpected\n" + "encoding, and pattern matches against commit data may be\n" + "inaccurate."); +} + const char *repo_logmsg_reencode(struct repository *r, const struct commit *commit, char **commit_encoding, @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r, * case we just return the commit message verbatim. */ if (!out) { - warning("unable to reencode commit to '%s'", output_encoding); + show_encoding_warning(output_encoding); return msg; } return out; ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 2:42 ` Jeff King @ 2021-10-09 13:47 ` Ævar Arnfjörð Bjarmason 2021-10-27 11:03 ` Jeff King 2021-10-22 22:58 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-09 13:47 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Oct 08 2021, Jeff King wrote: > On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote: > >> If that were coupled with, say, an advise() call to explain that output >> and matching might be inaccurate (and show that _once_), that might >> might it more clear what's going on. >> >> Now I am sympathetic to flooding the user with too many messages, and >> maybe reducing this to a single instance of "some commit messages could >> not be re-encoded; output and matching might be inaccurate" is the right >> thing. But in a sense, it's also working as designed: what you asked for >> is producing wrong output over and over, and Git is saying so. > > The single-output version would perhaps be something like this: > > diff --git a/pretty.c b/pretty.c > index 708b618cfe..c86f41bae7 100644 > --- a/pretty.c > +++ b/pretty.c > @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding) > return strbuf_detach(&tmp, NULL); > } > > +static void show_encoding_warning(const char *output_encoding) > +{ > + static int seen_warning; > + > + if (seen_warning) > + return; > + > + seen_warning = 1; > + warning("one or more commits could not be re-encoded to '%s'", > + output_encoding); > + advise("When re-encoding fails, some output may be in an unexpected\n" > + "encoding, and pattern matches against commit data may be\n" > + "inaccurate."); > +} > + > const char *repo_logmsg_reencode(struct repository *r, > const struct commit *commit, > char **commit_encoding, > @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r, > * case we just return the commit message verbatim. > */ > if (!out) { > - warning("unable to reencode commit to '%s'", output_encoding); > + show_encoding_warning(output_encoding); > return msg; > } > return out; I'm not categorically opposed to having this warning stay, because I can imagine an implementation of it that isn't so overmatching, but I think neither one of us is going to work on that, so .... We have the exact same edge case in the grep-a-file code, and it's a delibirate decision to stay quiet about it. Let's assume your pattern contains non-ASCII, you've asked for locale-aware grepping, you want \d+ to mean all sorts of Unicode fuzzyness instead of just [0-9] etc. (not yet implemented, PCRE_UCP). It would still be annoying to see a warning every time you grep without providing a pathspec that blacklists say the 100 '*.png' files that are in your tree. And that's a case where we *could* say that the user should mark them with .gitattributes or whatever, but making every git user go through that would be annoying to them, so we just do our best and silently fall back. Similarly with this, let's say I'm on an OS that likes UTF-16 better, as some of our users do, I have the relevant settings to re-encode git.git or linux.git. Now run: git -c i18n.logOutputEncoding=utf16 log --author=foobarbaz And it's 2 warnings in git.git, and 157 in linux.git. Anyway, your commit above makes that 1 in both cases, which is certainly a *lot* better. But I think similar to the grep-a-file case it's still way to much, now just because I've got some old badly encoded commits in my history I'll see one warning every time a log revision walk/grep comes across those. On the "not categorically opposed" I think that this sort of warning /might/ be good if: * It weren't enabled by default, or at least as a transition had something like a advise() message pointing at a fsck.skipList-like (or other instructions, replace?) about how to quiet it. * We're realy dumb with how we chain data->iconv->PCRE here. I.e. we'll whine that we can't reencode just to match my "foobarbaz", but we could just keep walking past bad bytes. We should ideally say "we might have matched your data, but *because* of the encoding failure we couldn't. We can easily know with something like "foobarbaz" that that's not the case. Anyway, I think all of that we can leave for the future, because I'd simply assumed that this was based on some report that had to do with someone not matching with --grep or whatever because of the details of the encoding going wrong, e.g. a string that's later in a commit message, but a misencoded character tripped it up. But in this case this seems to have been because someone tried to feed "HTML" to it, which is not an encoding, and something iconv_open() has (I daresay) always and will always error on. It returns -1 and sets errno=EINVAL. So having a warning or other detection in the revision loop seems backwards to me, surely we want something like the below instead? I.e. die as close to bad option parsing as possible? Note that this will now die if we have NO_ICONV=Y, even with your patch, that seems like a feature. Now we'll silently ignore it. I.e. we'll warn because we failed to re-encode, but we're using a stub function whose body is: { if (e) *e = 0; return NULL; } So ditto the garbage encoding name we should have died a lot earlier. Aside from your warning test the below makes tests in t4201-shortlog.sh fail, but those just seem broken to me. I.e. they seem to rely on git staying quiet if i18n.commitencoding is set to garbage. diff --git a/environment.c b/environment.c index 43bb1b35ffe..c26b18f8e5c 100644 --- a/environment.c +++ b/environment.c @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath) const char *get_log_output_encoding(void) { - return git_log_output_encoding ? git_log_output_encoding + const char *encoding = git_log_output_encoding ? git_log_output_encoding : get_commit_output_encoding(); +#ifndef NO_ICONV + iconv_t conv; + conv = iconv_open(encoding, "UTF-8"); + if (conv == (iconv_t) -1 && errno == EINVAL) + die_errno("the '%s' encoding is not known to iconv", encoding); +#else + if (strcmp(encoding, "UTF-8")) + die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding); +#endif + return encoding; } const char *get_commit_output_encoding(void) ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 13:47 ` Ævar Arnfjörð Bjarmason @ 2021-10-27 11:03 ` Jeff King 2021-10-29 10:47 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2021-10-27 11:03 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Sat, Oct 09, 2021 at 03:47:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > But in this case this seems to have been because someone tried to feed > "HTML" to it, which is not an encoding, and something iconv_open() has > (I daresay) always and will always error on. It returns -1 and sets > errno=EINVAL. > > So having a warning or other detection in the revision loop seems > backwards to me, surely we want something like the below instead? > I.e. die as close to bad option parsing as possible? Sorry for the slow response; this got thrown on my "to think about and look at later" pile. Yeah, I agree that if we sanity-checked the encoding up front, that would cover the case we saw in practice, and goes a long way towards catching any practical errors. But I think this patch is tricky: > diff --git a/environment.c b/environment.c > index 43bb1b35ffe..c26b18f8e5c 100644 > --- a/environment.c > +++ b/environment.c > @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath) > > const char *get_log_output_encoding(void) > { > - return git_log_output_encoding ? git_log_output_encoding > + const char *encoding = git_log_output_encoding ? git_log_output_encoding > : get_commit_output_encoding(); > +#ifndef NO_ICONV > + iconv_t conv; > + conv = iconv_open(encoding, "UTF-8"); > + if (conv == (iconv_t) -1 && errno == EINVAL) > + die_errno("the '%s' encoding is not known to iconv", encoding); > +#else > + if (strcmp(encoding, "UTF-8")) > + die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding); > +#endif > + return encoding; > } So one obvious problem here is that we call this function once per commit, so it's a lot of extra iconv_open() calls. But obviously we could use a static flag to do it once per process. The other issue is that it is assuming UTF-8 on one end of the conversion. But we aren't necessarily doing such a conversion; it depends on the commit's on-disk encoding, and the requested output encoding. In particular: - if both of those match, we do not need to call iconv at all (see the same_encoding() check in repo_logmsg_reencode()). With the patch above, the NO_ICONV case would start to die() when both are say iso8859-1, even though it currently works. - likewise, even if you have iconv support, it's possible that your preferred encoding is not compatible with utf8. In which case iconv_open() may complain, even though the actual conversion we'd ask it to do would succeed. I.e., I don't think there's a way to just ask iconv "does this encoding name by itself make any sense". You can only ask it about to/from combos. So I think a much better version of this is to catch the _actual_ iconv_open() call we make. And if it fails, say "woah, this combo of encodings isn't supported". The reason I didn't do that in the earlier patch is that all of this is obscured inside reencode_string_len(), which does both the iconv_open() and the iconv() call. We could surface that error information. But I'm not sure it would make sense to die() in that case. While for something like "git log --encoding=nonsense" every commit is going to fail to re-encode, it's still possible that iconv_open() failures are commit-specific. I.e., you could have some garbage commit in your history with an unsupported encoding, and you wouldn't want to die() for it (it's the same case you are complaining about having a warning for, but much worse). I suspect the best we could do along these lines is to wait until a real iconv_open(to, from) fails, and then as a fallback try: iconv_open("UTF-8", from); iconv_open(to, "UTF-8"); to sanity-check them individually, and guess that one of them is broken if it can't go to/from UTF-8. But even that feels like it's making assumptions about both the system iconv, and the charsets people use. To be clear, I'd expect that most people just use utf-8 in the first place, and even if they don't that their system has some basic utf-8 support. But we are deep into the realm of weird corner cases here, and the utility of this warning / error-checking doesn't seem high enough to merit the possible regressions we'd get by trying to make too many assumptions. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-27 11:03 ` Jeff King @ 2021-10-29 10:47 ` Ævar Arnfjörð Bjarmason 2021-10-29 20:40 ` Jeff King 0 siblings, 1 reply; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-29 10:47 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Wed, Oct 27 2021, Jeff King wrote: > On Sat, Oct 09, 2021 at 03:47:16PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> But in this case this seems to have been because someone tried to feed >> "HTML" to it, which is not an encoding, and something iconv_open() has >> (I daresay) always and will always error on. It returns -1 and sets >> errno=EINVAL. >> >> So having a warning or other detection in the revision loop seems >> backwards to me, surely we want something like the below instead? >> I.e. die as close to bad option parsing as possible? > > Sorry for the slow response; this got thrown on my "to think about and > look at later" pile. > > Yeah, I agree that if we sanity-checked the encoding up front, that > would cover the case we saw in practice, and goes a long way towards > catching any practical errors. > > But I think this patch is tricky: > >> diff --git a/environment.c b/environment.c >> index 43bb1b35ffe..c26b18f8e5c 100644 >> --- a/environment.c >> +++ b/environment.c >> @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath) >> >> const char *get_log_output_encoding(void) >> { >> - return git_log_output_encoding ? git_log_output_encoding >> + const char *encoding = git_log_output_encoding ? git_log_output_encoding >> : get_commit_output_encoding(); >> +#ifndef NO_ICONV >> + iconv_t conv; >> + conv = iconv_open(encoding, "UTF-8"); >> + if (conv == (iconv_t) -1 && errno == EINVAL) >> + die_errno("the '%s' encoding is not known to iconv", encoding); >> +#else >> + if (strcmp(encoding, "UTF-8")) >> + die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding); >> +#endif >> + return encoding; >> } > > So one obvious problem here is that we call this function once per > commit, so it's a lot of extra iconv_open() calls. But obviously we > could use a static flag to do it once per process. Yes, or the diff below, which seems like a better idea. I.e. stop calling this in a loop if we know we'll need it, have setup_revisions() populate it once. > The other issue is that it is assuming UTF-8 on one end of the > conversion. But we aren't necessarily doing such a conversion; it > depends on the commit's on-disk encoding, and the requested output > encoding. In particular: > > - if both of those match, we do not need to call iconv at all (see the > same_encoding() check in repo_logmsg_reencode()). With the patch > above, the NO_ICONV case would start to die() when both are say > iso8859-1, even though it currently works. > > - likewise, even if you have iconv support, it's possible that your > preferred encoding is not compatible with utf8. In which case > iconv_open() may complain, even though the actual conversion we'd > ask it to do would succeed. > > I.e., I don't think there's a way to just ask iconv "does this encoding > name by itself make any sense". You can only ask it about to/from > combos. Yes, I'm not saying it covers the general problem, but that it covers the specific complained-about issue of a completely nonsensical encoding like "HTML". We should simply error on that on command startup, whether or not we have any commits to visit. But yes, if we want to cover specific encoding issues, e.g. not being able to squash CJK UTF-8 into US-ASCII that needs to be per-commit, but... > So I think a much better version of this is to catch the _actual_ > iconv_open() call we make. And if it fails, say "woah, this combo of > encodings isn't supported". The reason I didn't do that in the earlier > patch is that all of this is obscured inside reencode_string_len(), > which does both the iconv_open() and the iconv() call. We could surface > that error information. > > But I'm not sure it would make sense to die() in that case. While for > something like "git log --encoding=nonsense" every commit is going to > fail to re-encode, it's still possible that iconv_open() failures are > commit-specific. I.e., you could have some garbage commit in your > history with an unsupported encoding, and you wouldn't want to die() for > it (it's the same case you are complaining about having a warning for, > but much worse). > > I suspect the best we could do along these lines is to wait until a real > iconv_open(to, from) fails, and then as a fallback try: > > iconv_open("UTF-8", from); > iconv_open(to, "UTF-8"); > > to sanity-check them individually, and guess that one of them is broken > if it can't go to/from UTF-8. But even that feels like it's making > assumptions about both the system iconv, and the charsets people use. > > To be clear, I'd expect that most people just use utf-8 in the first > place, and even if they don't that their system has some basic utf-8 > support. But we are deep into the realm of weird corner cases here, and > the utility of this warning / error-checking doesn't seem high enough to > merit the possible regressions we'd get by trying to make too many > assumptions. I still think the right move here is to just revert your patch. Yes we can think of a bunch of tricky edge cases that need to be fixed, but this is a long-standing problem, and the change as-is has some pretty bad UI problems. I think any change that solves those (probably starting with the below) is going to be relatively major this close to release. So per <87ily7m1mv.fsf@evledraar.gmail.com> why can't we just revert the warning(), and then consider a good way forward that covers some/all of these cases we've noted? Which should probably start with the below diff, so we can clearly distinguish startup bad encoding names from runtime ones (we'd put that proposed "die if iconv_open() doesn't like it" somewhere in setup_revisions()). diff --git a/builtin/log.c b/builtin/log.c index f75d87e8d7f..2b3a607e947 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -539,7 +539,7 @@ static void show_tagger(const char *buf, struct rev_info *rev) pp.fmt = rev->commit_format; pp.date_mode = rev->date_mode; - pp_user_info(&pp, "Tagger", &out, buf, get_log_output_encoding()); + pp_user_info(&pp, "Tagger", &out, buf, rev->log_output_encoding); fprintf(rev->diffopt.file, "%s", out.buf); strbuf_release(&out); } @@ -1208,7 +1208,7 @@ static void make_cover_letter(struct rev_info *rev, int use_separate_file, log.file = rev->diffopt.file; log.groups = SHORTLOG_GROUP_AUTHOR; for (i = 0; i < nr; i++) - shortlog_add_commit(&log, list[i]); + shortlog_add_commit(rev, &log, list[i]); shortlog_output(&log); diff --git a/builtin/merge.c b/builtin/merge.c index ea3112e0c0b..1a8fbafc149 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -431,6 +431,7 @@ static void squash_message(struct commit *commit, struct commit_list *remotehead ctx.abbrev = rev.abbrev; ctx.date_mode = rev.date_mode; ctx.fmt = rev.commit_format; + ctx.output_encoding = rev.log_output_encoding; strbuf_addstr(&out, "Squashed commit of the following:\n"); while ((commit = get_revision(&rev)) != NULL) { diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 36cb909ebaa..905ba7462f3 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -165,7 +165,7 @@ static void show_commit(struct commit *commit, void *data) ctx.date_mode = revs->date_mode; ctx.date_mode_explicit = revs->date_mode_explicit; ctx.fmt = revs->commit_format; - ctx.output_encoding = get_log_output_encoding(); + ctx.output_encoding = revs->log_output_encoding; ctx.color = revs->diffopt.use_color; pretty_print_commit(&ctx, commit, &buf); if (buf.len) { diff --git a/builtin/shortlog.c b/builtin/shortlog.c index e7f7af5de3f..2e5409a4fcc 100644 --- a/builtin/shortlog.c +++ b/builtin/shortlog.c @@ -198,7 +198,8 @@ static void insert_records_from_trailers(struct shortlog *log, unuse_commit_buffer(commit, commit_buffer); } -void shortlog_add_commit(struct shortlog *log, struct commit *commit) +void shortlog_add_commit(struct rev_info *rev, struct shortlog *log, + struct commit *commit) { struct strbuf ident = STRBUF_INIT; struct strbuf oneline = STRBUF_INIT; @@ -210,7 +211,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit) ctx.abbrev = log->abbrev; ctx.print_email_subject = 1; ctx.date_mode.type = DATE_NORMAL; - ctx.output_encoding = get_log_output_encoding(); + ctx.output_encoding = rev->log_output_encoding; if (!log->summary) { if (log->user_format) @@ -254,7 +255,7 @@ static void get_from_rev(struct rev_info *rev, struct shortlog *log) if (prepare_revision_walk(rev)) die(_("revision walk setup failed")); while ((commit = get_revision(rev)) != NULL) - shortlog_add_commit(log, commit); + shortlog_add_commit(rev, log, commit); } static int parse_uint(char const **arg, int comma, int defval) diff --git a/bundle.c b/bundle.c index a0bb687b0f4..32d74ab0bee 100644 --- a/bundle.c +++ b/bundle.c @@ -448,6 +448,7 @@ static int write_bundle_refs(int bundle_fd, struct rev_info *revs) struct bundle_prerequisites_info { struct object_array *pending; + struct rev_info *rev; int fd; }; @@ -464,7 +465,7 @@ static void write_bundle_prerequisites(struct commit *commit, void *data) write_or_die(bpi->fd, buf.buf, buf.len); ctx.fmt = CMIT_FMT_ONELINE; - ctx.output_encoding = get_log_output_encoding(); + ctx.output_encoding = bpi->rev->log_output_encoding; strbuf_reset(&buf); pretty_print_commit(&ctx, commit, &buf); strbuf_trim(&buf); @@ -544,6 +545,7 @@ int create_bundle(struct repository *r, const char *path, die("revision walk setup failed"); bpi.fd = bundle_fd; bpi.pending = &revs_copy.pending; + bpi.rev = &revs; traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi); object_array_remove_duplicates(&revs_copy.pending); diff --git a/log-tree.c b/log-tree.c index 644893fd8cf..d79324d5bfd 100644 --- a/log-tree.c +++ b/log-tree.c @@ -737,7 +737,7 @@ void show_log(struct rev_info *opt) raw = (opt->commit_format == CMIT_FMT_USERFORMAT); format_display_notes(&commit->object.oid, ¬ebuf, - get_log_output_encoding(), raw); + opt->log_output_encoding, raw); ctx.notes_message = strbuf_detach(¬ebuf, NULL); } @@ -758,7 +758,7 @@ void show_log(struct rev_info *opt) ctx.mailmap = opt->mailmap; ctx.color = opt->diffopt.use_color; ctx.expand_tabs_in_log = opt->expand_tabs_in_log; - ctx.output_encoding = get_log_output_encoding(); + ctx.output_encoding = opt->log_output_encoding; ctx.rev = opt; if (opt->from_ident.mail_begin && opt->from_ident.name_begin) ctx.from_ident = &opt->from_ident; diff --git a/pretty.c b/pretty.c index fe95107ae5a..6eb64e8189d 100644 --- a/pretty.c +++ b/pretty.c @@ -2048,15 +2048,17 @@ void pretty_print_commit(struct pretty_print_context *pp, int indent = 4; const char *msg; const char *reencoded; - const char *encoding; + const char *encoding = pp->output_encoding; int need_8bit_cte = pp->need_8bit_cte; + if (!encoding) + BUG("should have .output_encoding"); + if (pp->fmt == CMIT_FMT_USERFORMAT) { format_commit_message(commit, user_format, sb, pp); return; } - encoding = get_log_output_encoding(); msg = reencoded = logmsg_reencode(commit, NULL, encoding); if (pp->fmt == CMIT_FMT_ONELINE || cmit_fmt_is_mail(pp->fmt)) @@ -2121,6 +2123,14 @@ void pp_commit_easy(enum cmit_fmt fmt, const struct commit *commit, struct strbuf *sb) { struct pretty_print_context pp = {0}; + static const char *output_encoding; + + if (!output_encoding) { + const char *tmp = get_log_output_encoding(); + output_encoding = tmp; + } + pp.fmt = fmt; + pp.output_encoding = output_encoding; pretty_print_commit(&pp, commit, sb); } diff --git a/remote-curl.c b/remote-curl.c index d69156312bd..733a525bc73 100644 --- a/remote-curl.c +++ b/remote-curl.c @@ -345,8 +345,14 @@ static void free_discovery(struct discovery *d) static int show_http_message(struct strbuf *type, struct strbuf *charset, struct strbuf *msg) { + static const char *output_encoding; const char *p, *eol; + if (!output_encoding) { + const char *tmp = get_log_output_encoding(); + output_encoding = tmp; + } + /* * We only show text/plain parts, as other types are likely * to be ugly to look at on the user's terminal. @@ -354,7 +360,7 @@ static int show_http_message(struct strbuf *type, struct strbuf *charset, if (strcmp(type->buf, "text/plain")) return -1; if (charset->len) - strbuf_reencode(msg, charset->buf, get_log_output_encoding()); + strbuf_reencode(msg, charset->buf, output_encoding); strbuf_trim(msg); if (!msg->len) diff --git a/revision.c b/revision.c index ab7c1358042..0b2ad87f28e 100644 --- a/revision.c +++ b/revision.c @@ -2866,7 +2866,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED, &revs->grep_filter); - if (!is_encoding_utf8(get_log_output_encoding())) + revs->log_output_encoding = get_log_output_encoding(); + if (!is_encoding_utf8(revs->log_output_encoding)) revs->grep_filter.ignore_locale = 1; compile_grep_patterns(&revs->grep_filter); diff --git a/revision.h b/revision.h index 5578bb4720a..bf9dbca727e 100644 --- a/revision.h +++ b/revision.h @@ -237,6 +237,7 @@ struct rev_info { struct string_list *ref_message_ids; int add_signoff; const char *extra_headers; + const char *log_output_encoding; const char *log_reencode; const char *subject_prefix; int patch_name_max; diff --git a/shortlog.h b/shortlog.h index 3f7e9aabcae..f809617f8a0 100644 --- a/shortlog.h +++ b/shortlog.h @@ -30,7 +30,9 @@ struct shortlog { void shortlog_init(struct shortlog *log); -void shortlog_add_commit(struct shortlog *log, struct commit *commit); +struct rev_info; +void shortlog_add_commit(struct rev_info *rev, struct shortlog *log, + struct commit *commit); void shortlog_output(struct shortlog *log); diff --git a/submodule.c b/submodule.c index c6890705241..f10e9c34ff6 100644 --- a/submodule.c +++ b/submodule.c @@ -501,7 +501,7 @@ static void print_submodule_diff_summary(struct repository *r, struct rev_info * while ((commit = get_revision(rev))) { struct pretty_print_context ctx = {0}; ctx.date_mode = rev->date_mode; - ctx.output_encoding = get_log_output_encoding(); + ctx.output_encoding = rev->log_output_encoding; strbuf_setlen(&sb, 0); repo_format_commit_message(r, commit, format, &sb, &ctx); ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-29 10:47 ` Ævar Arnfjörð Bjarmason @ 2021-10-29 20:40 ` Jeff King 2021-10-29 20:45 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2021-10-29 20:40 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Oct 29, 2021 at 12:47:36PM +0200, Ævar Arnfjörð Bjarmason wrote: > > The other issue is that it is assuming UTF-8 on one end of the > > conversion. But we aren't necessarily doing such a conversion; it > > depends on the commit's on-disk encoding, and the requested output > > encoding. In particular: > > > > - if both of those match, we do not need to call iconv at all (see the > > same_encoding() check in repo_logmsg_reencode()). With the patch > > above, the NO_ICONV case would start to die() when both are say > > iso8859-1, even though it currently works. > > > > - likewise, even if you have iconv support, it's possible that your > > preferred encoding is not compatible with utf8. In which case > > iconv_open() may complain, even though the actual conversion we'd > > ask it to do would succeed. > > > > I.e., I don't think there's a way to just ask iconv "does this encoding > > name by itself make any sense". You can only ask it about to/from > > combos. > > Yes, I'm not saying it covers the general problem, but that it covers > the specific complained-about issue of a completely nonsensical encoding > like "HTML". We should simply error on that on command startup, whether > or not we have any commits to visit. I definitely agree with you on the direction, and I don't mind if we don't cover every case. What I was trying to point out above though is that the patch you showed actually _regresses_ some cases, and it's hard to robustly avoid that. > So per <87ily7m1mv.fsf@evledraar.gmail.com> why can't we just revert the > warning(), and then consider a good way forward that covers some/all of > these cases we've noted? Right, I agreed with that in the other thread. You may need to convince Junio. ;) TBH I am not even sure it is worth spending a lot of brain cells on the "and then consider..." part. Over all these years, we've had one report, and it simply misunderstand what "--encoding" was for. I thought it was something we could fix up easily by checking a return value, but IMHO doing it right is quite tricky because of iconv()'s limited interface, and the risk of regression outweighs the potential benefit. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-29 20:40 ` Jeff King @ 2021-10-29 20:45 ` Junio C Hamano 2021-10-29 20:52 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2021-10-29 20:45 UTC (permalink / raw) To: Jeff King Cc: Ævar Arnfjörð Bjarmason, Krzysztof Żelechowski, git, Hamza Mahfooz Jeff King <peff@peff.net> writes: > TBH I am not even sure it is worth spending a lot of brain cells on the > "and then consider..." part. Over all these years, we've had one report, > and it simply misunderstand what "--encoding" was for. I thought it was > something we could fix up easily by checking a return value, but IMHO > doing it right is quite tricky because of iconv()'s limited interface, > and the risk of regression outweighs the potential benefit. I tend to agree with the above. Let's not over-engineer things. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-29 20:45 ` Junio C Hamano @ 2021-10-29 20:52 ` Junio C Hamano 2021-10-29 21:10 ` Jeff King 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2021-10-29 20:52 UTC (permalink / raw) To: Jeff King Cc: Ævar Arnfjörð Bjarmason, Krzysztof Żelechowski, git, Hamza Mahfooz Junio C Hamano <gitster@pobox.com> writes: > Jeff King <peff@peff.net> writes: > >> TBH I am not even sure it is worth spending a lot of brain cells on the >> "and then consider..." part. Over all these years, we've had one report, >> and it simply misunderstand what "--encoding" was for. I thought it was >> something we could fix up easily by checking a return value, but IMHO >> doing it right is quite tricky because of iconv()'s limited interface, >> and the risk of regression outweighs the potential benefit. > > I tend to agree with the above. Let's not over-engineer things. ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----- From: Junio C Hamano <gitster@pobox.com> Date: Fri, 29 Oct 2021 13:48:58 -0700 Subject: [PATCH] Revert "logmsg_reencode(): warn when iconv() fails" This reverts commit fd680bc5 (logmsg_reencode(): warn when iconv() fails, 2021-08-27). Throwing a warning for each and every commit that gets reencoded, without allowing a way to squelch, would make it unpleasant for folks who have to deal with an ancient part of the history in an old project that used wrong encoding in the commits. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/pretty-options.txt | 4 +--- pretty.c | 6 +----- t/t4210-log-i18n.sh | 7 ------- 3 files changed, 2 insertions(+), 15 deletions(-) diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt index b3af850608..54d8bb3db0 100644 --- a/Documentation/pretty-options.txt +++ b/Documentation/pretty-options.txt @@ -40,9 +40,7 @@ people using 80-column terminals. defaults to UTF-8. Note that if an object claims to be encoded in `X` and we are outputting in `X`, we will output the object verbatim; this means that invalid sequences in the original - commit may be copied to the output. Likewise, if iconv(3) fails - to convert the commit, we will output the original object - verbatim, along with a warning. + commit may be copied to the output. --expand-tabs=<n>:: --expand-tabs:: diff --git a/pretty.c b/pretty.c index 73b5ead509..9631529c10 100644 --- a/pretty.c +++ b/pretty.c @@ -671,11 +671,7 @@ const char *repo_logmsg_reencode(struct repository *r, * If the re-encoding failed, out might be NULL here; in that * case we just return the commit message verbatim. */ - if (!out) { - warning("unable to reencode commit to '%s'", output_encoding); - return msg; - } - return out; + return out ? out : msg; } static int mailmap_name(const char **email, size_t *email_len, diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 0141f36e33..d2dfcf164e 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -131,11 +131,4 @@ do fi done -test_expect_success 'log shows warning when conversion fails' ' - enc=this-encoding-does-not-exist && - git log -1 --encoding=$enc 2>err && - echo "warning: unable to reencode commit to ${SQ}${enc}${SQ}" >expect && - test_cmp expect err -' - test_done -- 2.33.1-1007-g607b33ccc6 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-29 20:52 ` Junio C Hamano @ 2021-10-29 21:10 ` Jeff King 0 siblings, 0 replies; 33+ messages in thread From: Jeff King @ 2021-10-29 21:10 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Oct 29, 2021 at 01:52:02PM -0700, Junio C Hamano wrote: > > I tend to agree with the above. Let's not over-engineer things. > > ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----- > From: Junio C Hamano <gitster@pobox.com> > Date: Fri, 29 Oct 2021 13:48:58 -0700 > Subject: [PATCH] Revert "logmsg_reencode(): warn when iconv() fails" > > This reverts commit fd680bc5 (logmsg_reencode(): warn when iconv() > fails, 2021-08-27). Throwing a warning for each and every commit > that gets reencoded, without allowing a way to squelch, would make > it unpleasant for folks who have to deal with an ancient part of the > history in an old project that used wrong encoding in the commits. Thanks for tying this up. I do think fd680bc5's documentation was good, though. So I'd suggest squashing this in, or applying it on top. -- >8 -- Subject: [PATCH] log: document --encoding behavior on iconv() failure We already note that we may produce invalid output when we skip calling iconv() altogether. But we may also do so if iconv() fails, and we have no good alternative. Let's document this to avoid surprising users. Signed-off-by: Jeff King <peff@peff.net> --- Documentation/pretty-options.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-options.txt index 54d8bb3db0..dc685be363 100644 --- a/Documentation/pretty-options.txt +++ b/Documentation/pretty-options.txt @@ -40,7 +40,9 @@ people using 80-column terminals. defaults to UTF-8. Note that if an object claims to be encoded in `X` and we are outputting in `X`, we will output the object verbatim; this means that invalid sequences in the original - commit may be copied to the output. + commit may be copied to the output. Likewise, if iconv(3) fails + to convert the commit, we will quietly output the original + object verbatim. --expand-tabs=<n>:: --expand-tabs:: -- 2.33.1.1446.g3ed047b518 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 2:42 ` Jeff King 2021-10-09 13:47 ` Ævar Arnfjörð Bjarmason @ 2021-10-22 22:58 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-22 22:58 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Fri, Oct 08 2021, Jeff King wrote: > On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote: > >> If that were coupled with, say, an advise() call to explain that output >> and matching might be inaccurate (and show that _once_), that might >> might it more clear what's going on. >> >> Now I am sympathetic to flooding the user with too many messages, and >> maybe reducing this to a single instance of "some commit messages could >> not be re-encoded; output and matching might be inaccurate" is the right >> thing. But in a sense, it's also working as designed: what you asked for >> is producing wrong output over and over, and Git is saying so. > > The single-output version would perhaps be something like this: > > diff --git a/pretty.c b/pretty.c > index 708b618cfe..c86f41bae7 100644 > --- a/pretty.c > +++ b/pretty.c > @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding) > return strbuf_detach(&tmp, NULL); > } > > +static void show_encoding_warning(const char *output_encoding) > +{ > + static int seen_warning; > + > + if (seen_warning) > + return; > + > + seen_warning = 1; > + warning("one or more commits could not be re-encoded to '%s'", > + output_encoding); > + advise("When re-encoding fails, some output may be in an unexpected\n" > + "encoding, and pattern matches against commit data may be\n" > + "inaccurate."); > +} > + > const char *repo_logmsg_reencode(struct repository *r, > const struct commit *commit, > char **commit_encoding, > @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r, > * case we just return the commit message verbatim. > */ > if (!out) { > - warning("unable to reencode commit to '%s'", output_encoding); > + show_encoding_warning(output_encoding); > return msg; > } > return out; *Poke* about this. We're getting pretty close to release. I think the WIP hunk I posted in https://lore.kernel.org/git/871r4umfnm.fsf@evledraar.gmail.com/ presents a good way forward with this. I.e. aside from how wise it is to warn about this in general, I think there's a pretty bad bug in your implementation where what should effectively be a parse_options() or git_config()-time one-off warning is being fired off for every single commit in "git log" in some cases. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-09 2:36 ` Jeff King 2021-10-09 2:42 ` Jeff King @ 2021-10-10 13:53 ` Johannes Sixt 2021-10-10 15:43 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 33+ messages in thread From: Johannes Sixt @ 2021-10-10 13:53 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz, Ævar Arnfjörð Bjarmason Am 09.10.21 um 04:36 schrieb Jeff King: > On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> I ran into this while testing the grep coloring patch[1] (but it's >> unrelated). Before this commit e.g.: >> >> LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l >> 28333 >> >> So ~3k lines for my last 100 commits, but then: >> >> $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning >> 299 >> >> At first I thought it was spewing warnings for every failed re-encoded >> line in some cases, because I get hundreds at a time sometimes, but it's >> because stderr and stdout I/O buffering is different (a common >> case). Adding a "fflush(stderr)" "fixes" that. > > I don't think the buffering is the issue. By default stderr flushes on > lines, and we flush commits after showing them. If you take away "-P" > (or look at the combined 2>&1 output in order), you'll see that they are > grouped. > > Now one thing you might notice is that there may be multiple warnings > between output commits. But that's because we really are re-encoding > each of those intermediate commits to do your --author grep. And if that > re-encoding fails, we may well be producing the wrong output, because > the matching won't be correct (in your case, presumably the correct > output should be _nothing_, because Æ is not an ascii character). I don't understand why i18n.commitEncoding plays a role here. Isn't it an instruction "when you make a commit, mark the commit message having this encoding". But grep does not make a commit. If this were i18n.logOuputEncoding it would make much more sense. Have I misunderstood the meaning of the two options? -- Hannes ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: *Really* noisy encoding warnings post-v2.33.0 2021-10-10 13:53 ` Johannes Sixt @ 2021-10-10 15:43 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-10-10 15:43 UTC (permalink / raw) To: Johannes Sixt Cc: Jeff King, Junio C Hamano, Krzysztof Żelechowski, git, Hamza Mahfooz On Sun, Oct 10 2021, Johannes Sixt wrote: > Am 09.10.21 um 04:36 schrieb Jeff King: >> On Sat, Oct 09, 2021 at 02:58:10AM +0200, Ævar Arnfjörð Bjarmason wrote: >> >>> I ran into this while testing the grep coloring patch[1] (but it's >>> unrelated). Before this commit e.g.: >>> >>> LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100|wc -l >>> 28333 >>> >>> So ~3k lines for my last 100 commits, but then: >>> >>> $ LC_ALL=C ~/g/git/git -P -c i18n.commitEncoding=ascii log --author=Ævar -100 2>&1|grep -c ^warning >>> 299 >>> >>> At first I thought it was spewing warnings for every failed re-encoded >>> line in some cases, because I get hundreds at a time sometimes, but it's >>> because stderr and stdout I/O buffering is different (a common >>> case). Adding a "fflush(stderr)" "fixes" that. >> >> I don't think the buffering is the issue. By default stderr flushes on >> lines, and we flush commits after showing them. If you take away "-P" >> (or look at the combined 2>&1 output in order), you'll see that they are >> grouped. >> >> Now one thing you might notice is that there may be multiple warnings >> between output commits. But that's because we really are re-encoding >> each of those intermediate commits to do your --author grep. And if that >> re-encoding fails, we may well be producing the wrong output, because >> the matching won't be correct (in your case, presumably the correct >> output should be _nothing_, because Æ is not an ascii character). > > I don't understand why i18n.commitEncoding plays a role here. Isn't it > an instruction "when you make a commit, mark the commit message having > this encoding". But grep does not make a commit. > > If this were i18n.logOuputEncoding it would make much more sense. > > Have I misunderstood the meaning of the two options? It doesn't, see my later <871r4umfnm.fsf@evledraar.gmail.com> for when I got it right. For the amount of warnings etc. it's the same, whether we call iconv because it's e.g. ascii->utf-8 and that triggers iconv() issues, or (with i18n.logOuputEncoding) utf-8->ascii. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 0:57 ` Jeff King 2021-08-25 16:31 ` Junio C Hamano @ 2021-08-25 23:00 ` Krzysztof Żelechowski 2021-08-27 18:33 ` Jeff King 2021-08-25 23:28 ` Krzysztof Żelechowski 2 siblings, 1 reply; 33+ messages in thread From: Krzysztof Żelechowski @ 2021-08-25 23:00 UTC (permalink / raw) To: Jeff King; +Cc: git Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze: > As far as what you're trying to accomplish, HTML-escaping isn't > something Git supports. You'll have to run the output through an > external escaping mechanism. Have you looked at the format? It is a HTML fragment with placeholders to be filled by git log. I cannot run the output through an external escaping mechanism because it will kill the markup that is already there. Chris ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 23:00 ` git log --encoding=HTML is not supported Krzysztof Żelechowski @ 2021-08-27 18:33 ` Jeff King 0 siblings, 0 replies; 33+ messages in thread From: Jeff King @ 2021-08-27 18:33 UTC (permalink / raw) To: Krzysztof Żelechowski; +Cc: git On Thu, Aug 26, 2021 at 01:00:44AM +0200, Krzysztof Żelechowski wrote: > Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze: > > As far as what you're trying to accomplish, HTML-escaping isn't > > something Git supports. You'll have to run the output through an > > external escaping mechanism. > > Have you looked at the format? It is a HTML fragment with placeholders to be > filled by git log. I cannot run the output through an external escaping > mechanism because it will kill the markup that is already there. Right, what I mean is that you'd have to pull the output out of Git, and then format (and escape) it separately. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 0:57 ` Jeff King 2021-08-25 16:31 ` Junio C Hamano 2021-08-25 23:00 ` git log --encoding=HTML is not supported Krzysztof Żelechowski @ 2021-08-25 23:28 ` Krzysztof Żelechowski 2021-08-25 23:47 ` Bryan Turner 2 siblings, 1 reply; 33+ messages in thread From: Krzysztof Żelechowski @ 2021-08-25 23:28 UTC (permalink / raw) To: Jeff King; +Cc: git Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze: > diff --git a/pretty.c b/pretty.c Please fix the manual for git log. It should say what encoding is recognised (namely if supported by iconv(1), except that POSIX character maps of iconv(1p) are not supported), and that an unrecognised encoding is ignored. I would also like to see the HTML encoding supported independently of iconv, which seems like a pretty easy thing to do. Dream on, I guess? Chris ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 23:28 ` Krzysztof Żelechowski @ 2021-08-25 23:47 ` Bryan Turner 2021-08-26 15:37 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Bryan Turner @ 2021-08-25 23:47 UTC (permalink / raw) To: Krzysztof Żelechowski; +Cc: Jeff King, Git Users On Wed, Aug 25, 2021 at 4:29 PM Krzysztof Żelechowski <giecrilj@stegny.2a.pl> wrote: > > Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze: > > diff --git a/pretty.c b/pretty.c > > Please fix the manual for git log. It should say what encoding is recognised > (namely if supported by iconv(1), except that POSIX character maps of > iconv(1p) are not supported), and that an unrecognised encoding is ignored. > > I would also like to see the HTML encoding supported independently of iconv, > which seems like a pretty easy thing to do. Dream on, I guess? I suspect the answer is less "Dream on" and more "Patches welcome." > > Chris > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-25 23:47 ` Bryan Turner @ 2021-08-26 15:37 ` Junio C Hamano 2021-08-26 20:52 ` Krzysztof Żelechowski 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2021-08-26 15:37 UTC (permalink / raw) To: Bryan Turner; +Cc: Krzysztof Żelechowski, Jeff King, Git Users Bryan Turner <bturner@atlassian.com> writes: > On Wed, Aug 25, 2021 at 4:29 PM Krzysztof Żelechowski > <giecrilj@stegny.2a.pl> wrote: >> >> Dnia środa, 25 sierpnia 2021 02:57:47 CEST Jeff King pisze: >> > diff --git a/pretty.c b/pretty.c >> >> Please fix the manual for git log. It should say what encoding is recognised >> (namely if supported by iconv(1), except that POSIX character maps of >> iconv(1p) are not supported), and that an unrecognised encoding is ignored. >> >> I would also like to see the HTML encoding supported independently of iconv, >> which seems like a pretty easy thing to do. Dream on, I guess? > > I suspect the answer is less "Dream on" and more "Patches welcome." Patches are welcomed but not before a proposed design is freshed out. I am sure people do welcome the design discussion. Pieces taken from the contents stored in Git (like "the title of the commit", "the name of the author of the commit") may need quoting and/or escaping when they are incorporated into a string to become parts of "output", and the way the quoting/escaping must be done would depend on the "host" language/format. HTML has its own requirements for how these pieces coming from Git contents are quoted, but it will not be the only "host" language that needs quoting. The requirement for the feature we are "Dreaming on" may be much closer to the "host language" options (e.g. --tcl, --perl ...) the "git for-each-ref" command has. These options tells us to format each piece of information (e.g. "%(subject)") taken from Git as a natural 'string' constant in the host language, so that git for-each-ref --shell \ --format='do_something %(authorname) %(authoremail)' would write a shell script that calls "do_something" command with two arguments for each ref enumerated by the command, without having to worry about whitespaces and quote characters that may appear in the interpolated pieces. It is immediately obvious that within the context of the for-each-ref command, the follwoing would equally be useful (note: this is already "dreaming on" and does not exist yet): echo "<ul>" git for-each-ref --html \ --format='<li>%(authoremail)</li>' echo "</ul>" As we have been seeing efforts to port features around the --format option between the for-each-ref family of commands and the log family of commands, I would also imagine that it would be natural future direction to extend it to the latter and eventually allow git log --html \ --format='<tr><td>%h</td><td>%s</td>...</tr>' to format each commit into a single row in HTML table, and things like that. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-26 15:37 ` Junio C Hamano @ 2021-08-26 20:52 ` Krzysztof Żelechowski 2021-08-27 15:59 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Krzysztof Żelechowski @ 2021-08-26 20:52 UTC (permalink / raw) To: Bryan Turner, Junio C Hamano; +Cc: Jeff King, Git Users Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze: > git log --html \ > --format='<tr><td>%h</td><td>%s</td>...</tr>' I would like to be able to say: { git config i18n.logOutputEscape HTML; } What do you think? Chris ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-26 20:52 ` Krzysztof Żelechowski @ 2021-08-27 15:59 ` Junio C Hamano 2021-08-27 18:37 ` Jeff King 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2021-08-27 15:59 UTC (permalink / raw) To: Krzysztof Żelechowski; +Cc: Bryan Turner, Jeff King, Git Users Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes: > Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze: >> git log --html \ >> --format='<tr><td>%h</td><td>%s</td>...</tr>' > > I would like to be able to say: > { git config i18n.logOutputEscape HTML; } > > What do you think? It depends on what it does. If the configuration means that "git log" output (with any supported options, like "-p") will be given with '<' written to '^lt;' etc. so that it becomes safe to dump it in HTML, it fails to interest me at all. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-27 15:59 ` Junio C Hamano @ 2021-08-27 18:37 ` Jeff King 2021-08-27 21:51 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Jeff King @ 2021-08-27 18:37 UTC (permalink / raw) To: Junio C Hamano; +Cc: Krzysztof Żelechowski, Bryan Turner, Git Users On Fri, Aug 27, 2021 at 08:59:40AM -0700, Junio C Hamano wrote: > Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes: > > > Dnia czwartek, 26 sierpnia 2021 17:37:40 CEST Junio C Hamano pisze: > >> git log --html \ > >> --format='<tr><td>%h</td><td>%s</td>...</tr>' > > > > I would like to be able to say: > > { git config i18n.logOutputEscape HTML; } > > > > What do you think? > > It depends on what it does. > > If the configuration means that "git log" output (with any supported > options, like "-p") will be given with '<' written to '^lt;' etc. so > that it becomes safe to dump it in HTML, it fails to interest me at > all. Yeah, I think things get pretty weird when you start thinking about dumping whole filenames and diff contents. I wouldn't be opposed to an option for the pretty formatter to have encodings. Something like: git log --format='%(authorname:quote=html)' I'd probably put off implementing that until we actually unify the for-each-ref and pretty formats, though (we do not even have %(authorname) at this point!). The latter already has a quoting mechanism for shell/perl/python/tcl (though it is not per-atom, and I wouldn't be opposed to a --format-quote option that quoted all pretty.c placeholders). -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: git log --encoding=HTML is not supported 2021-08-27 18:37 ` Jeff King @ 2021-08-27 21:51 ` Junio C Hamano 0 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2021-08-27 21:51 UTC (permalink / raw) To: Jeff King; +Cc: Krzysztof Żelechowski, Bryan Turner, Git Users Jeff King <peff@peff.net> writes: > I wouldn't be opposed to an option for the pretty formatter to have > encodings. Something like: > > git log --format='%(authorname:quote=html)' > > I'd probably put off implementing that until we actually unify the > for-each-ref and pretty formats, though (we do not even have > %(authorname) at this point!). The latter already has a quoting > mechanism for shell/perl/python/tcl (though it is not per-atom, and I > wouldn't be opposed to a --format-quote option that quoted all pretty.c > placeholders). Yeah, per-atom would be nice, as we can specify which piece needs what kind of quoting, e.g. git log --format=' if test %(authoremail:quote=shell) != "gitster@pobox.com" then echo %(authorname:quote=html+shell) fi ' | sh can be used to write a script to produce an "echo" command with a shell literal string as its argument, where that literal string writes author's name in a way that can be inserted in an HTML document, but omitting the commits by me. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2021-10-29 21:10 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-24 9:00 git log --encoding=HTML is not supported Krzysztof Żelechowski 2021-08-24 10:31 ` Bagas Sanjaya 2021-08-24 10:33 ` Krzysztof Żelechowski 2021-08-24 10:46 ` Bagas Sanjaya 2021-08-24 19:11 ` Junio C Hamano 2021-08-25 0:57 ` Jeff King 2021-08-25 16:31 ` Junio C Hamano 2021-08-27 18:30 ` Jeff King 2021-08-27 18:32 ` Jeff King 2021-08-27 19:47 ` Junio C Hamano 2021-10-09 0:58 ` *Really* noisy encoding warnings post-v2.33.0 Ævar Arnfjörð Bjarmason 2021-10-09 1:29 ` Ævar Arnfjörð Bjarmason 2021-10-09 2:36 ` Jeff King 2021-10-09 2:42 ` Jeff King 2021-10-09 13:47 ` Ævar Arnfjörð Bjarmason 2021-10-27 11:03 ` Jeff King 2021-10-29 10:47 ` Ævar Arnfjörð Bjarmason 2021-10-29 20:40 ` Jeff King 2021-10-29 20:45 ` Junio C Hamano 2021-10-29 20:52 ` Junio C Hamano 2021-10-29 21:10 ` Jeff King 2021-10-22 22:58 ` Ævar Arnfjörð Bjarmason 2021-10-10 13:53 ` Johannes Sixt 2021-10-10 15:43 ` Ævar Arnfjörð Bjarmason 2021-08-25 23:00 ` git log --encoding=HTML is not supported Krzysztof Żelechowski 2021-08-27 18:33 ` Jeff King 2021-08-25 23:28 ` Krzysztof Żelechowski 2021-08-25 23:47 ` Bryan Turner 2021-08-26 15:37 ` Junio C Hamano 2021-08-26 20:52 ` Krzysztof Żelechowski 2021-08-27 15:59 ` Junio C Hamano 2021-08-27 18:37 ` Jeff King 2021-08-27 21:51 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).