git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] gitweb: apply fallback encoding before highlight
@ 2016-04-20 11:32 Shin Kojima
  2016-05-02 17:49 ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Shin Kojima @ 2016-04-20 11:32 UTC (permalink / raw)
  To: git; +Cc: Christopher Wilson, Jakub Narebski, Shin Kojima

Some multi-byte character encodings (such as Shift_JIS and GBK) have
characters whose final bytes is an ASCII '\' (0x5c), and they
will be displayed as funny-characters even if $fallback_encoding is
correct.  This is because `highlight` command always expects UTF-8
encoded strings from STDIN.

    $ echo 'my $v = "申";' | highlight --syntax perl | w3m -T text/html -dump
    my $v = "申";

    $ echo 'my $v = "申";' | iconv -f UTF-8 -t Shift_JIS | highlight \
        --syntax perl | iconv -f Shift_JIS -t UTF-8 | w3m -T text/html -dump

    iconv: (stdin):9:135: cannot convert
    my $v = "

This patch prepare git blob objects to be encoded into UTF-8 before
highlighting in the manner of `to_utf8` subroutine.
---
 gitweb/gitweb.perl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 05d7910..2fddf75 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -3935,6 +3935,9 @@ sub run_highlighter {
 
 	close $fd;
 	open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
+	          quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse',
+	            '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);',
+	            '--', "-fe=$fallback_encoding")." | ".
 	          quote_command($highlight_bin).
 	          " --replace-tabs=8 --fragment --syntax $syntax |"
 		or die_error(500, "Couldn't open file or run syntax highlighter");
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: apply fallback encoding before highlight
  2016-04-20 11:32 [PATCH] gitweb: apply fallback encoding before highlight Shin Kojima
@ 2016-05-02 17:49 ` Junio C Hamano
  2016-05-02 18:12   ` Jakub Narębski
  2016-05-03 13:00   ` [PATCH v2] " Shin Kojima
  0 siblings, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2016-05-02 17:49 UTC (permalink / raw)
  To: Shin Kojima; +Cc: git, Christopher Wilson, Jakub Narebski

Shin Kojima <shin@kojima.org> writes:

> Some multi-byte character encodings (such as Shift_JIS and GBK) have
> characters whose final bytes is an ASCII '\' (0x5c), and they
> will be displayed as funny-characters even if $fallback_encoding is
> correct.  This is because `highlight` command always expects UTF-8
> encoded strings from STDIN.
>
>     $ echo 'my $v = "申";' | highlight --syntax perl | w3m -T text/html -dump
>     my $v = "申";
>
>     $ echo 'my $v = "申";' | iconv -f UTF-8 -t Shift_JIS | highlight \
>         --syntax perl | iconv -f Shift_JIS -t UTF-8 | w3m -T text/html -dump
>
>     iconv: (stdin):9:135: cannot convert
>     my $v = "
>
> This patch prepare git blob objects to be encoded into UTF-8 before
> highlighting in the manner of `to_utf8` subroutine.
> ---

The single liner Perl invoked from the script felt a bit too dense
to my taste but other than that I have no complaints to what the
patched code does.

Jakub, does it look good to you, too?

Please sign-off your patch (see Documentation/SubmittingPatches).

Thanks.


>  gitweb/gitweb.perl | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 05d7910..2fddf75 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -3935,6 +3935,9 @@ sub run_highlighter {
>  
>  	close $fd;
>  	open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
> +	          quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse',
> +	            '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);',
> +	            '--', "-fe=$fallback_encoding")." | ".
>  	          quote_command($highlight_bin).
>  	          " --replace-tabs=8 --fragment --syntax $syntax |"
>  		or die_error(500, "Couldn't open file or run syntax highlighter");

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: apply fallback encoding before highlight
  2016-05-02 17:49 ` Junio C Hamano
@ 2016-05-02 18:12   ` Jakub Narębski
  2016-05-03 13:00   ` [PATCH v2] " Shin Kojima
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Narębski @ 2016-05-02 18:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shin Kojima, git, Christopher Wilson

On Mon, May 2, 2016 at 7:49 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Shin Kojima <shin@kojima.org> writes:
>
>> This patch prepare git blob objects to be encoded into UTF-8 before
>> highlighting in the manner of `to_utf8` subroutine.
>> ---
>
> The single liner Perl invoked from the script felt a bit too dense
> to my taste but other than that I have no complaints to what the
> patched code does.
>
> Jakub, does it look good to you, too?

Yes, it looks all right to me. $^X is current Perl. -CO means that
the output is utf8 (for `highlight` command), -p means read all lines
and print them (it could be replaced by "print" command in one-liner),
-s is here to pass $fallback_encoding as $fe (it could be replaced,
but it would require some fiddling with quoting $s), -e '...' means
execute one line.

> Please sign-off your patch (see Documentation/SubmittingPatches).
>
> Thanks.
>
>
>>  gitweb/gitweb.perl | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 05d7910..2fddf75 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -3935,6 +3935,9 @@ sub run_highlighter {
>>
>>       close $fd;
>>       open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
>> +               quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse',
>> +                 '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);',
>> +                 '--', "-fe=$fallback_encoding")." | ".
>>                 quote_command($highlight_bin).
>>                 " --replace-tabs=8 --fragment --syntax $syntax |"
>>               or die_error(500, "Couldn't open file or run syntax highlighter");



-- 
Jakub Narebski

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] gitweb: apply fallback encoding before highlight
  2016-05-02 17:49 ` Junio C Hamano
  2016-05-02 18:12   ` Jakub Narębski
@ 2016-05-03 13:00   ` Shin Kojima
  2016-05-03 18:33     ` Junio C Hamano
  1 sibling, 1 reply; 8+ messages in thread
From: Shin Kojima @ 2016-05-03 13:00 UTC (permalink / raw)
  To: git; +Cc: Christopher Wilson, Jakub Narebski, Shin Kojima

Some multi-byte character encodings (such as Shift_JIS and GBK) have
characters whose final bytes is an ASCII '\' (0x5c), and they
will be displayed as funny-characters even if $fallback_encoding is
correct.  This is because `highlight` command always expects UTF-8
encoded strings from STDIN.

    $ echo 'my $v = "申";' | highlight --syntax perl | w3m -T text/html -dump
    my $v = "申";

    $ echo 'my $v = "申";' | iconv -f UTF-8 -t Shift_JIS | highlight \
        --syntax perl | iconv -f Shift_JIS -t UTF-8 | w3m -T text/html -dump

    iconv: (stdin):9:135: cannot convert
    my $v = "

This patch prepare git blob objects to be encoded into UTF-8 before
highlighting in the manner of `to_utf8` subroutine.

Signed-off-by: Shin Kojima <shin@kojima.org>
---

Changes for v2:
    - Add Signed-off-by

Thanks,
Shin Kojima

 gitweb/gitweb.perl | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 05d7910..2fddf75 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -3935,6 +3935,9 @@ sub run_highlighter {
 
 	close $fd;
 	open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
+	          quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse',
+	            '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);',
+	            '--', "-fe=$fallback_encoding")." | ".
 	          quote_command($highlight_bin).
 	          " --replace-tabs=8 --fragment --syntax $syntax |"
 		or die_error(500, "Couldn't open file or run syntax highlighter");
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] gitweb: apply fallback encoding before highlight
  2016-05-03 13:00   ` [PATCH v2] " Shin Kojima
@ 2016-05-03 18:33     ` Junio C Hamano
  2016-05-04  8:34       ` Shin Kojima
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2016-05-03 18:33 UTC (permalink / raw)
  To: Shin Kojima; +Cc: git, Christopher Wilson, Jakub Narebski

Shin Kojima <shin@kojima.org> writes:

> Some multi-byte character encodings (such as Shift_JIS and GBK) have
> characters whose final bytes is an ASCII '\' (0x5c), and they
> will be displayed as funny-characters even if $fallback_encoding is
> correct.

Just out of curiosity, do people still use Shift_JIS aka MS-Kanji?
It feels so last-decade, if not last-century ;-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] gitweb: apply fallback encoding before highlight
  2016-05-03 18:33     ` Junio C Hamano
@ 2016-05-04  8:34       ` Shin Kojima
  2016-05-04 19:34         ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Shin Kojima @ 2016-05-04  8:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shin Kojima, git, Christopher Wilson, Jakub Narebski

On Tue, May 03, 2016 at 11:33:42AM -0700, Junio C Hamano wrote:
> Shin Kojima <shin@kojima.org> writes:
> 
> > Some multi-byte character encodings (such as Shift_JIS and GBK) have
> > characters whose final bytes is an ASCII '\' (0x5c), and they
> > will be displayed as funny-characters even if $fallback_encoding is
> > correct.
> 
> Just out of curiosity, do people still use Shift_JIS aka MS-Kanji?
> It feels so last-decade, if not last-century ;-)

Yes, they do. There are still tons of code from 90's lying around.

For migrating our codebase from cp932 (Windows31J/MS-Kanji), I keep
failing to persuade my boss saying it has no incentives to do so.

I can say this patch, to consider $fallback_encoding while
highlighting, is fairly rational.  But I also feel this is too much
just for specific outdated character encodings, it is completely
useless for the most part of gitweb users in the world.

I would rather prefer to generate feedback from you all to convince
our management if this patch is not acceptable.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] gitweb: apply fallback encoding before highlight
  2016-05-04  8:34       ` Shin Kojima
@ 2016-05-04 19:34         ` Junio C Hamano
  2016-05-05 10:22           ` Shin Kojima
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2016-05-04 19:34 UTC (permalink / raw)
  To: Shin Kojima; +Cc: git, Christopher Wilson, Jakub Narebski

Shin Kojima <shin@kojima.org> writes:

> I can say this patch, to consider $fallback_encoding while
> highlighting, is fairly rational.  But I also feel this is too much
> just for specific outdated character encodings, it is completely
> useless for the most part of gitweb users in the world.

Oh, don't get me wrong. I do think what the patch does is very
sensible and have no intention of rejecting it.

Unless somebody finds a new bug in it, but in that case, we won't be
rejecting it but would be improving on it.

As I said, the question was "Just out of curiosity", since it's been
so long since I was in any part of software work done in Japan.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] gitweb: apply fallback encoding before highlight
  2016-05-04 19:34         ` Junio C Hamano
@ 2016-05-05 10:22           ` Shin Kojima
  0 siblings, 0 replies; 8+ messages in thread
From: Shin Kojima @ 2016-05-05 10:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shin Kojima, git, Christopher Wilson, Jakub Narebski

> Oh, don't get me wrong. I do think what the patch does is very
> sensible and have no intention of rejecting it.

I'm sorry for making you worry, my poor English had caused some
misunderstanding.  I raised this Shift_JIS related problem (a.k.a
"ダメ文字" in Japanese) might attract your interest knowingly.

I would like to hear frank opinions from engineers who have high
abilities like you. ;)

> As I said, the question was "Just out of curiosity", since it's been
> so long since I was in any part of software work done in Japan.

Having said that, many people in Japan are still suffering from these
character encoding barriers.  This was a huge shock for me, since
I was studying information science in China as a Japanese foreign
student.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-05-05 10:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-20 11:32 [PATCH] gitweb: apply fallback encoding before highlight Shin Kojima
2016-05-02 17:49 ` Junio C Hamano
2016-05-02 18:12   ` Jakub Narębski
2016-05-03 13:00   ` [PATCH v2] " Shin Kojima
2016-05-03 18:33     ` Junio C Hamano
2016-05-04  8:34       ` Shin Kojima
2016-05-04 19:34         ` Junio C Hamano
2016-05-05 10:22           ` Shin Kojima

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).