git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shin Kojima <shin@kojima.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: Shin Kojima <shin@kojima.org>, git@vger.kernel.org, jnareb@gmail.com
Subject: Re: [PATCH] gitweb: Measure offsets against UTF-8 flagged string
Date: Wed, 2 May 2018 20:47:45 +0900	[thread overview]
Message-ID: <20180502114743.cmetkapax2y5mipj@skmbp> (raw)
In-Reply-To: <xmqqtvrqv632.fsf@gitster-ct.c.googlers.com>

> ideally we should be able to say "function X takes non-UTF8 and
> works on it", "function Y takes UTF8 and works on it", and "function
> Z takes non-UTF8 and gives UTF8 data back" for each functions
> clearly, not "function W can take either UTF8 or any other garbage
> and tries to return UTF8".

Yes, I totally agree with that.


> Some codepaths in the resulting codeflow look even harder than they
> already are.  For example, format_rem_add_lines_pair() calls
> untabify() and then feeds its result to esc_html().

Honestly, I discovered this problem exactly from
format_rem_add_lines_pair().  In my environment($fallback_encoding
= 'cp932'), some commitdiff shows garbled text only inside color-words
portions.

I added a reproduce process at the end of this message.

After my investigation, I thought format_rem_add_lines_pair() tries to
use split()/index()/substr()/etc against raw blob contents and that
produces funny characters.  These builtin functions should be used to
decoded string.

untabify() looks proper place for me to decode blob contents
beforehand, as it definitely is not to be used for binary contests
like images and compressed snapshot files.

I'm sure using to_utf8() in untabify() fixes this problem.  In fact,
there is also another similar problem in blame function that assumes
blob contents as if utf8 encoded:

    binmode $fd, ':utf8';

I personally consider text blob contents should be decoded as soon as
possible and esc_html() should not contain to_utf8(), but the
codeflow is slightly vast and I couldn't eliminate the possibility of
calls from somewhere else that does not care character encodings.

So yes, I would appreciate hearing your thoughts.


> Also, does it even "fix" the problem to use to_utf8() here in the
> first place?  Untabify is about aligning the character after a HT to
> multiple of 8 display position, so we'd want measure display width,
> which is not the same as either byte count or char count.

Following is a reproduce process:

    $ git --version
        git version 2.17.0

    $ mkdir test
    $ cd test
    $ git init
    $ echo 'モバイル' | iconv -f UTF-8 -t Shift_JIS > dummy
    $ git add .
    $ git commit -m 'init'
    $ echo 'インスタント' | iconv -f UTF-8 -t Shift_JIS > dummy
    $ git commit -am 'change'
    $ git instaweb
    $ echo 'our $fallback_encoding = "cp932";' >> .git/gitweb/gitweb_config.perl
    $ w3m -dump 'http://127.0.0.1:1234/?p=.git;a=commitdiff'

What I got:

    gitprojects / .git / commitdiff
    [commit   ] ? search: [                    ] [ ]re
    summary | shortlog | log | commit | commitdiff | tree
    raw | patch | inline | side by side (parent: 79e26fe)
    change master

    author    Shin Kojima <shin@kojima.org>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)
    committer Shin Kojima <shin@kojima.org>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)

    dummy  patch | blob | history


    diff --git a/dummy b/dummy
    index ac37f38..31fb96a 100644 (file)
    --- a/dummy
    +++ b/dummy
    @@ -1 +1 @@
    -coイル
    +Cンスタント
    Unnamed repository; edit this file 'description' to name the repository.
    RSS Atom

What I expected:

    gitprojects / .git / commitdiff
    [commit   ] ? search: [                    ] [ ]re
    summary | shortlog | log | commit | commitdiff | tree
    raw | patch | inline | side by side (parent: 79e26fe)
    change master

    author    Shin Kojima <shin@kojima.org>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)
    committer Shin Kojima <shin@kojima.org>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)

    dummy  patch | blob | history


    diff --git a/dummy b/dummy
    index ac37f38..31fb96a 100644 (file)
    --- a/dummy
    +++ b/dummy
    @@ -1 +1 @@
    -モバイル
    +インスタント
    Unnamed repository; edit this file 'description' to name the repository.
    RSS Atom


-- 
Shin Kojima

  reply	other threads:[~2018-05-02 11:47 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-01  6:40 [PATCH] gitweb: Measure offsets against UTF-8 flagged string Shin Kojima
2018-05-02  8:01 ` Junio C Hamano
2018-05-02 11:47   ` Shin Kojima [this message]
2018-05-03 12:40   ` Jakub Narebski
2018-05-03 15:16     ` Shin Kojima
2018-05-04  2:38     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180502114743.cmetkapax2y5mipj@skmbp \
    --to=shin@kojima.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).