git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [bug] git diff --word-diff gives wrong result for utf-8 chinese
@ 2022-11-29  3:46 Ping Yin
  2022-11-29  3:49 ` Ping Yin
  2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 12+ messages in thread
From: Ping Yin @ 2022-11-29  3:46 UTC (permalink / raw)
  To: mailinggit list

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]

Result of "git diff"

-  为1
+  为2

or (if chinese can not be displayed correctly)

-  <E4><B8><BA>1
+  <E4><B8><BA>2

Actual result of "git diff --color-words"

<E4><B8>[-<BA>1-]{+<BA>2+}

Expected result of "git diff --color-words"

为[-1-]{+2+}

or (if chinese can not be displayed correctly)

<E4><B8><BA>[-1-]{+2+}


Ping Yin

[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 5682 bytes --]

[-- Attachment #3: image.png --]
[-- Type: image/png, Size: 9992 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29  3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
@ 2022-11-29  3:49 ` Ping Yin
  2022-11-29  8:18   ` Bagas Sanjaya
  2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 12+ messages in thread
From: Ping Yin @ 2022-11-29  3:49 UTC (permalink / raw)
  To: mailinggit list

sorry, typo, s/--color-words/--word-diff/g

Ping Yin

On Tue, Nov 29, 2022 at 11:46 AM Ping Yin <pkufranky@gmail.com> wrote:
>
> Result of "git diff"
>
> -  为1
> +  为2
>
> or (if chinese can not be displayed correctly)
>
> -  <E4><B8><BA>1
> +  <E4><B8><BA>2
>
> Actual result of "git diff --color-words"
>
> <E4><B8>[-<BA>1-]{+<BA>2+}
>
> Expected result of "git diff --color-words"
>
> 为[-1-]{+2+}
>
> or (if chinese can not be displayed correctly)
>
> <E4><B8><BA>[-1-]{+2+}
>
>
> Ping Yin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29  3:49 ` Ping Yin
@ 2022-11-29  8:18   ` Bagas Sanjaya
  0 siblings, 0 replies; 12+ messages in thread
From: Bagas Sanjaya @ 2022-11-29  8:18 UTC (permalink / raw)
  To: Ping Yin, mailinggit list

On 11/29/22 10:49, Ping Yin wrote:
> sorry, typo, s/--color-words/--word-diff/g
> 

Hi, welcome to Git mailing list!

Please remind yourself:

  * Do not send HTML mails, send plain-text ones instead. Many mailing
    lists (including vger.kernel.org that powers Git ML) reject HTML
    emails for these are likely spam. Make sure your email isn't mangled
    (tabs and spaces as-is, no line wrapping).

  * Do not top-post, reply inline with appropriate context instead. I
    have to cut the reply context as a result.

  * When you submit a patch and people reply with their reviews, engage
    with them (either sending revised patch addressing the reviews or
    reply with justification). They will ignore you if you ignore them.

Thanks.

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29  3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
  2022-11-29  3:49 ` Ping Yin
@ 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
  2022-11-29 11:32   ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-29 10:52 UTC (permalink / raw)
  To: Ping Yin; +Cc: mailinggit list


On Tue, Nov 29 2022, Ping Yin wrote:

> Result of "git diff"
>
> -  为1
> +  为2
>
> or (if chinese can not be displayed correctly)
>
> -  <E4><B8><BA>1
> +  <E4><B8><BA>2
>
> Actual result of "git diff --color-words"
>
> <E4><B8>[-<BA>1-]{+<BA>2+}
>
> Expected result of "git diff --color-words"
>
> 为[-1-]{+2+}
>
> or (if chinese can not be displayed correctly)

I think we could provide new ways to do per-language diffs, right now
you can use --word-diff-regex, but it would be handy to e.g. have a
built-in collection of those (or other non-regex boundary algorithms)
for Chinese etc.

But as for considering this a bug, or changing the existing behavior I
think we'd need to deal with:

 * We (approximately) split on space now, which is certainly
   ASCII-biased, and outside of CJK fairly somewhat universal.

 * If we're going to split on "real words" in some cross-language aware
   way, are we going to run into conflicts between what different
   languages would consider sensible rules?

 * We probably don't want to make the "diff" dependent on the user's
   locale, but e.g. saying "I want a Chinese diff" via a CLI option
   would be OK.

 * Even for say Chinese, there's probably interesting edge cases when
   it's combined with other languages or character sets (e.g. Chinese +
   HTML).




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
@ 2022-11-29 11:32   ` Junio C Hamano
  2022-11-29 18:23     ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-11-29 11:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Ping Yin, mailinggit list

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> or (if chinese can not be displayed correctly)
>>
>> -  <E4><B8><BA>1
>> +  <E4><B8><BA>2
>>
>> Actual result of "git diff --color-words"
>>
>> <E4><B8>[-<BA>1-]{+<BA>2+}
>> ...
> I think we could provide new ways to do per-language diffs, right now
> you can use --word-diff-regex, but it would be handy to e.g. have a
> built-in collection of those (or other non-regex boundary algorithms)
> for Chinese etc.

I think you are thinking it with unnecessaarily complexity.  

The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.

No matter what language we are using, we shouldn't do that.

I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29 11:32   ` Junio C Hamano
@ 2022-11-29 18:23     ` Jeff King
  2022-11-29 18:54       ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2022-11-29 18:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list

On Tue, Nov 29, 2022 at 08:32:58PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> >> or (if chinese can not be displayed correctly)
> >>
> >> -  <E4><B8><BA>1
> >> +  <E4><B8><BA>2
> >>
> >> Actual result of "git diff --color-words"
> >>
> >> <E4><B8>[-<BA>1-]{+<BA>2+}
> >> ...
> > I think we could provide new ways to do per-language diffs, right now
> > you can use --word-diff-regex, but it would be handy to e.g. have a
> > built-in collection of those (or other non-regex boundary algorithms)
> > for Chinese etc.
> 
> I think you are thinking it with unnecessaarily complexity.  
> 
> The only thing that needs noticing in the above example, I think is,
> that the three-byte sequence E4-B8-BA in the example is supposed to
> be a single unicode character, and the actual result depicted can
> happen only if we (incorrectly) chomp that single character in the
> middle.
> 
> No matter what language we are using, we shouldn't do that.
> 
> I suspect that "--word-diff" internal is not even aware what a
> character is, but if you assume UTF-8 (precomposed), then you should
> be able to tell where the character boundary is by only looking at
> the high-bit patterns to avoid producing such an output.

Agreed that we should probably avoid breaking characters. But what
puzzles me more is that we break it between B8 and BA, and not
elsewhere. Why not between E4 and B8? Why not between BA and "1"?

If the rule is "break on ascii whitespace", then I'd have expected the
whole four-character sequence to be taken as a unit. In other words, it
does should not have to care that a character is, as long as the bytes
for space characters cannot appear inside other characters (which is
true of utf8).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29 18:23     ` Jeff King
@ 2022-11-29 18:54       ` Jeff King
  2022-12-01  7:08         ` Ping Yin
  2022-12-01  7:33         ` Ping Yin
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2022-11-29 18:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list

On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote:

> > I suspect that "--word-diff" internal is not even aware what a
> > character is, but if you assume UTF-8 (precomposed), then you should
> > be able to tell where the character boundary is by only looking at
> > the high-bit patterns to avoid producing such an output.
> 
> Agreed that we should probably avoid breaking characters. But what
> puzzles me more is that we break it between B8 and BA, and not
> elsewhere. Why not between E4 and B8? Why not between BA and "1"?
> 
> If the rule is "break on ascii whitespace", then I'd have expected the
> whole four-character sequence to be taken as a unit. In other words, it
> does should not have to care that a character is, as long as the bytes
> for space characters cannot appear inside other characters (which is
> true of utf8).

Even more puzzling is that it produces the expected output for me:

  [note that \x is a bash-ism]
  $ printf '\xe4\xb8\xba1' >one
  $ printf '\xe4\xb8\xba2' >two
  $ git diff --no-index --word-diff one two
  diff --git a/one b/two
  index 9ae469fc41..576e6e32d8 100644
  --- a/one
  +++ b/two
  @@ -1 +1 @@
  [-为1-]{+为2+}

I wonder if OP has diff.wordRegex config (or attributes triggering a
diff.*.wordRegex) that is doing something else.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29 18:54       ` Jeff King
@ 2022-12-01  7:08         ` Ping Yin
  2022-12-01  7:33         ` Ping Yin
  1 sibling, 0 replies; 12+ messages in thread
From: Ping Yin @ 2022-12-01  7:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	mailinggit list

> I wonder if OP has diff.wordRegex config (or attributes triggering a
> diff.*.wordRegex) that is doing something else.

Wow, you are right, sorry for the noise.

$ git config -l | grep word
diff.wordregex=[[:alnum:]_]+|[^[:space:]]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-11-29 18:54       ` Jeff King
  2022-12-01  7:08         ` Ping Yin
@ 2022-12-01  7:33         ` Ping Yin
  2022-12-01 14:51           ` Phillip Wood
  1 sibling, 1 reply; 12+ messages in thread
From: Ping Yin @ 2022-12-01  7:33 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	mailinggit list

> > If the rule is "break on ascii whitespace",

Is there a way to achieve this: break english by word, and break
chinese by utf-8 character

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-12-01  7:33         ` Ping Yin
@ 2022-12-01 14:51           ` Phillip Wood
  2022-12-01 15:51             ` Ping Yin
  2022-12-01 20:06             ` Jeff King
  0 siblings, 2 replies; 12+ messages in thread
From: Phillip Wood @ 2022-12-01 14:51 UTC (permalink / raw)
  To: Ping Yin, Jeff King
  Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason,
	mailinggit list

Hi Ping

On 01/12/2022 07:33, Ping Yin wrote:
>>> If the rule is "break on ascii whitespace",
> 
> Is there a way to achieve this: break english by word, and break
> chinese by utf-8 character

You could extend your current regex so that it matches whole utf-8 
codepoints which is what git does for the builtin userdiff regexes. I've 
not tested it but I think

git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf 
'[\xc0-\xff][\x80-\xbf]+')"

should work. The downside is that you end up with a .gitconfig that is 
not valid utf-8. Perhaps someone else has a clever idea to get around that.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-12-01 14:51           ` Phillip Wood
@ 2022-12-01 15:51             ` Ping Yin
  2022-12-01 20:06             ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Ping Yin @ 2022-12-01 15:51 UTC (permalink / raw)
  To: phillip.wood
  Cc: Jeff King, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	mailinggit list

Ping Yin

On Thu, Dec 1, 2022 at 10:51 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Ping
>
> On 01/12/2022 07:33, Ping Yin wrote:
>
> git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
> '[\xc0-\xff][\x80-\xbf]+')"
>
> should work. The downside is that you end up with a .gitconfig that is
> not valid utf-8. Perhaps someone else has a clever idea to get around that.

Wow, it works. Thanks very much.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
  2022-12-01 14:51           ` Phillip Wood
  2022-12-01 15:51             ` Ping Yin
@ 2022-12-01 20:06             ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2022-12-01 20:06 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ping Yin, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	mailinggit list

On Thu, Dec 01, 2022 at 02:51:29PM +0000, Phillip Wood wrote:

> On 01/12/2022 07:33, Ping Yin wrote:
> > > > If the rule is "break on ascii whitespace",
> > 
> > Is there a way to achieve this: break english by word, and break
> > chinese by utf-8 character
> 
> You could extend your current regex so that it matches whole utf-8
> codepoints which is what git does for the builtin userdiff regexes. I've not
> tested it but I think
> 
> git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf
> '[\xc0-\xff][\x80-\xbf]+')"
> 
> should work. The downside is that you end up with a .gitconfig that is not
> valid utf-8. Perhaps someone else has a clever idea to get around that.

I think in more advanced regular expression engines you can do stuff
like matching "[\x{4e00}-\x{9fcc}]", or even "\p{Han}". But I don't know
that the stock libc regex is capable of anything like this, even with
EREs. That's the only option Git provides for matching word regexes, but
in theory we could support libpcre. We already can optionally build
against it; we would just need config/plumbing to get it into
diff.c:find_word_boundaries().

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-01 20:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29  3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
2022-11-29  3:49 ` Ping Yin
2022-11-29  8:18   ` Bagas Sanjaya
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
2022-11-29 11:32   ` Junio C Hamano
2022-11-29 18:23     ` Jeff King
2022-11-29 18:54       ` Jeff King
2022-12-01  7:08         ` Ping Yin
2022-12-01  7:33         ` Ping Yin
2022-12-01 14:51           ` Phillip Wood
2022-12-01 15:51             ` Ping Yin
2022-12-01 20:06             ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).