[BUG] Unicode filenames handling in `git log --stat`

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [BUG] Unicode filenames handling in `git log --stat`
@ 2022-08-09 13:11 Alexander Meshcheryakov
  2022-08-09 18:20 ` Calvin Wan
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: Alexander Meshcheryakov @ 2022-08-09 13:11 UTC (permalink / raw)
  To: git

Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.

What did you do before the bug happened? (Steps to reproduce your issue)
touch Kyiv.txt Odesa.txt
git add -A
git commit -m 'Proper column widths'
touch Київ.txt Одеса.txt
git add -A
git commit -m 'Improper unicode width'
git log --stat

What did you expect to happen? (Expected behavior)
Stats column for added/removed lines should be properly aligned.

What happened instead? (Actual behavior)
Only changes for ASCII filenames are properly aligned.
Here is how stats look for ASCII filenames:
Kyiv.txt  | 0
Odesa.txt | 0

Compare with unicode filenames. I change actual letters to ASCII X to
avoid the issue in letter formatting:
XXXX.txt   | 0
XXXXX.txt | 0

What's different between what you expected and what actually happened?
See above

Anything else you want to add:
Looks like width of unicode strings is incorrectly calculated when
formatting log --stat output. It considers number of bytes as number
of characters in the string, but this is not correct for unicode
strings.

Please review the rest of the bug report below.
You can delete any lines you don't wish to share.

[System Info]
git version:
git version 2.34.1
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.15.0-40-generic #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC
2022 x86_64
compiler info: gnuc: 11.2
libc info: glibc: 2.35
$SHELL (typically, interactive shell): /bin/bash

[Enabled Hooks]
not run from a git repository - no hooks to show

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
@ 2022-08-09 18:20 ` Calvin Wan
  2022-08-09 19:03   ` Alexander Meshcheryakov
  2022-08-10  5:55   ` Junio C Hamano
  2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 42+ messages in thread
From: Calvin Wan @ 2022-08-09 18:20 UTC (permalink / raw)
  To: Alexander Meshcheryakov; +Cc: Calvin Wan, git

Hi Alexander,

Thank you for the report! I attempted to reproduce with the steps you
provided, but was unable to do so. What commands would I have to run
on a clean git repository to reproduce this?

- Calvin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-09 18:20 ` Calvin Wan
@ 2022-08-09 19:03   ` Alexander Meshcheryakov
  2022-08-09 21:36     ` Calvin Wan
  2022-08-10  5:55   ` Junio C Hamano
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Meshcheryakov @ 2022-08-09 19:03 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git

Hi Calvin,

Sure, let me demonstrate with clean git repo:

mkdir git_test; cd git_test
git init
touch Київ.txt Kyiv.txt Маріуполь.txt Mariupol.txt
git add -A
git commit -m 'foobar'

Now let's check with GNU awk `git log --stat` strings width in bytes:
$ git log --stat | LC_ALL=C awk '/txt/{print length($0), $0}'
27  Kyiv.txt               | 0
27  Mariupol.txt           | 0
27  Київ.txt           | 0
27  Маріуполь.txt | 0

And strings width in unicode characters:
$ git log --stat | LC_ALL=en_US.UTF-8 awk '/txt/{print length($0), $0}'
27  Kyiv.txt               | 0
27  Mariupol.txt           | 0
23  Київ.txt           | 0
18  Маріуполь.txt | 0

See, all lines are aligned to have length 27 bytes. But on the screen
this looks distorted because length in characters differs.

On Tue, 9 Aug 2022 at 22:20, Calvin Wan <calvinwan@google.com> wrote:
>
> Hi Alexander,
>
> Thank you for the report! I attempted to reproduce with the steps you
> provided, but was unable to do so. What commands would I have to run
> on a clean git repository to reproduce this?
>
> - Calvin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-09 19:03   ` Alexander Meshcheryakov
@ 2022-08-09 21:36     ` Calvin Wan
  0 siblings, 0 replies; 42+ messages in thread
From: Calvin Wan @ 2022-08-09 21:36 UTC (permalink / raw)
  To: Alexander Meshcheryakov; +Cc: git

Thank you for the additional information! I have reproduced the
issue after setting 'core.quotepath' to false. Are you interested in
submitting a patch to fix this? No pressure, but if you are I can
walk you through the process. If not I can take it from here :)

Calvin

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-09 18:20 ` Calvin Wan
  2022-08-09 19:03   ` Alexander Meshcheryakov
@ 2022-08-10  5:55   ` Junio C Hamano
  2022-08-10  8:40     ` Torsten Bögershausen
  1 sibling, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-08-10  5:55 UTC (permalink / raw)
  To: Calvin Wan; +Cc: Alexander Meshcheryakov, git

Calvin Wan <calvinwan@google.com> writes:

> Hi Alexander,
>
> Thank you for the report! I attempted to reproduce with the steps you
> provided, but was unable to do so. What commands would I have to run
> on a clean git repository to reproduce this?

Sounds like a symptom observable when the width computed by
utf8.c::git_gcwidth(), using the width table imported from
unicode.org, and the width the terminal thinks each of the displayed
character has, do not match (e.g. seen when ambiguous characters are
involved, https://unicode.org/reports/tr11/#Ambiguous).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10  5:55   ` Junio C Hamano
@ 2022-08-10  8:40     ` Torsten Bögershausen
  2022-08-10  8:56       ` Alexander Meshcheryakov
  2022-08-10 15:53       ` Junio C Hamano
  0 siblings, 2 replies; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-10  8:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Calvin Wan, Alexander Meshcheryakov, git

On Tue, Aug 09, 2022 at 10:55:31PM -0700, Junio C Hamano wrote:
> Calvin Wan <calvinwan@google.com> writes:
>
> > Hi Alexander,
> >
> > Thank you for the report! I attempted to reproduce with the steps you
> > provided, but was unable to do so. What commands would I have to run
> > on a clean git repository to reproduce this?
>
> Sounds like a symptom observable when the width computed by
> utf8.c::git_gcwidth(), using the width table imported from
> unicode.org, and the width the terminal thinks each of the displayed
> character has, do not match (e.g. seen when ambiguous characters are
> involved, https://unicode.org/reports/tr11/#Ambiguous).
>

I am not fully sure about that - I can reproduce it with Latin based
file names as well:

 git log --stat
[snip]
 Arger.txt  | 1 +
 Ärger.txt | 1 +
   2 files changed, 2 insertions(+)

From this very first experiment I would suspect that we use
strlen() somewhere rather then utf8.c::git_gcwidth()

More digging needed (but I don't promise anything today)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10  8:40     ` Torsten Bögershausen
@ 2022-08-10  8:56       ` Alexander Meshcheryakov
  2022-08-10  9:51         ` Torsten Bögershausen
  2022-08-10 15:53       ` Junio C Hamano
  1 sibling, 1 reply; 42+ messages in thread
From: Alexander Meshcheryakov @ 2022-08-10  8:56 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Junio C Hamano, Calvin Wan, git

I believe I have found exact place where strlen is used incorrectly
This is at diff.c:show_stats

https://github.com/git/git/blob/c50926e1f48891e2671e1830dbcd2912a4563450/diff.c#L2623

It probably should be replaced with one of utf8_width, utf8_strnwidth
or utf8_strwidth from utf8.c


On Wed, 10 Aug 2022 at 12:40, Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Tue, Aug 09, 2022 at 10:55:31PM -0700, Junio C Hamano wrote:
> > Calvin Wan <calvinwan@google.com> writes:
> >
> > > Hi Alexander,
> > >
> > > Thank you for the report! I attempted to reproduce with the steps you
> > > provided, but was unable to do so. What commands would I have to run
> > > on a clean git repository to reproduce this?
> >
> > Sounds like a symptom observable when the width computed by
> > utf8.c::git_gcwidth(), using the width table imported from
> > unicode.org, and the width the terminal thinks each of the displayed
> > character has, do not match (e.g. seen when ambiguous characters are
> > involved, https://unicode.org/reports/tr11/#Ambiguous).
> >
>
> I am not fully sure about that - I can reproduce it with Latin based
> file names as well:
>
>  git log --stat
> [snip]
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>    2 files changed, 2 insertions(+)
>
> From this very first experiment I would suspect that we use
> strlen() somewhere rather then utf8.c::git_gcwidth()
>
> More digging needed (but I don't promise anything today)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10  8:56       ` Alexander Meshcheryakov
@ 2022-08-10  9:51         ` Torsten Bögershausen
  2022-08-10 11:41           ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-10  9:51 UTC (permalink / raw)
  To: Alexander Meshcheryakov; +Cc: Junio C Hamano, Calvin Wan, git

On Wed, Aug 10, 2022 at 12:56:11PM +0400, Alexander Meshcheryakov wrote:

Thanks for digging.

(And please, try to avoid top-posting here in this list)

> I believe I have found exact place where strlen is used incorrectly
> This is at diff.c:show_stats
>
> https://github.com/git/git/blob/c50926e1f48891e2671e1830dbcd2912a4563450/diff.c#L2623
>
> It probably should be replaced with one of utf8_width, utf8_strnwidth
> or utf8_strwidth from utf8.c

That did not help here. If I understand it right, this function is not at all involved
in our `git log --stat` ?

I tried this patch (not 100% git-style) and didn't see any print.


--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
                        continue;
			                }
					                fill_print_name(file);
							-               len = strlen(file->print_name);
							+               {
							+                       const char *cp = file->print_name;
							+                       size_t l = strlen(file->print_name);
							+                       len = utf8_width(&cp, &l);
							+                       fprintf(stderr, "%s/%s:%d file->print_name='%s' len=%lu\n",
							+                               __FILE__, __FUNCTION__, __LINE__,
							+                               file->print_name, (unsigned long)len);
							+               }
							                if (max_len < len)
									                        max_len = len;


And looking here, it seems as we are calculating max_len here.
Still more digging needed (but I don't promise anything today)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10  9:51         ` Torsten Bögershausen
@ 2022-08-10 11:41           ` Torsten Bögershausen
  0 siblings, 0 replies; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-10 11:41 UTC (permalink / raw)
  To: Alexander Meshcheryakov; +Cc: Junio C Hamano, Calvin Wan, git

On Wed, Aug 10, 2022 at 11:51:58AM +0200, Torsten Bögershausen wrote:
> On Wed, Aug 10, 2022 at 12:56:11PM +0400, Alexander Meshcheryakov wrote:
>
> Thanks for digging.
>
> (And please, try to avoid top-posting here in this list)
>
> > I believe I have found exact place where strlen is used incorrectly
> > This is at diff.c:show_stats
> >
> > https://github.com/git/git/blob/c50926e1f48891e2671e1830dbcd2912a4563450/diff.c#L2623
> >
> > It probably should be replaced with one of utf8_width, utf8_strnwidth
> > or utf8_strwidth from utf8.c
>

Please forget what I wrote earlier - I was running the wrong `git` binary :-(
Sorry for the noise.

I can probably do more testing soonish.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10  8:40     ` Torsten Bögershausen
  2022-08-10  8:56       ` Alexander Meshcheryakov
@ 2022-08-10 15:53       ` Junio C Hamano
  2022-08-10 17:35         ` Torsten Bögershausen
  1 sibling, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-08-10 15:53 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Calvin Wan, Alexander Meshcheryakov, git

Torsten Bögershausen <tboegi@web.de> writes:

>  git log --stat
> [snip]
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>    2 files changed, 2 insertions(+)
>
> From this very first experiment I would suspect that we use
> strlen() somewhere rather then utf8.c::git_gcwidth()

Yeah, that does sound like the case, and quite honestly, knowing
that the diffstat code is way older than unicode-width code, which
was added by you in mid 2014, I am not all that surprised if we used
to use strlen() throughout and we still do by mistake.

Thanks for a doze of sanity.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [BUG] Unicode filenames handling in `git log --stat`
  2022-08-10 15:53       ` Junio C Hamano
@ 2022-08-10 17:35         ` Torsten Bögershausen
  0 siblings, 0 replies; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-10 17:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Calvin Wan, Alexander Meshcheryakov, git

On Wed, Aug 10, 2022 at 08:53:28AM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> >  git log --stat
> > [snip]
> >  Arger.txt  | 1 +
> >  Ärger.txt | 1 +
> >    2 files changed, 2 insertions(+)
> >
> > From this very first experiment I would suspect that we use
> > strlen() somewhere rather then utf8.c::git_gcwidth()
>
> Yeah, that does sound like the case, and quite honestly, knowing
> that the diffstat code is way older than unicode-width code, which
> was added by you in mid 2014, I am not all that surprised if we used
> to use strlen() throughout and we still do by mistake.
>
> Thanks for a doze of sanity.

Some 2 updates here:
- The strlen() needs a replacement.
  It looks as if the following patch helps:

/* somewhere in diff.c */
static size_t screen_utf8_width(const char *start)
{
       const char *cp = start;
       size_t remain = strlen(start);
       size_t width = 0;

       while (remain) {
               int n = utf8_width(&cp, &remain);
               if (n < 0)
                       return strlen(start); /* not UTF-8 ? Use strlen() */
               width += n;
       }
       return width;
}

@@ -2620,7 +2635,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
                        continue;
			                }
					                fill_print_name(file);
							-               len = strlen(file->print_name);
							+               len = screen_utf8_width(file->print_name);
							                if (max_len < len)
									                        max_len = len;

@@ -2743,7 +2758,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
                 * "scale" the filename
		                  */
				                  len = name_width;
						  -               name_len = strlen(name);
						  +               name_len = screen_utf8_width(name);
						                  if (name_width < name_len) {


=====================================
Let's see if I can make a proper patch out of it.

The second problem, and I hoped it wasn't, seems to be related to what
you had digged out earlier.

>Sounds like a symptom observable when the width computed by
>utf8.c::git_gcwidth(), using the width table imported from
>unicode.org, and the width the terminal thinks each of the displayed
>character has, do not match (e.g. seen when ambiguous characters are
>involved, https://unicode.org/reports/tr11/#Ambiguous).

That needs a second patch, probably after some more digging,
how unicode is rendedered on the different systems

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
  2022-08-09 18:20 ` Calvin Wan
@ 2022-08-14 13:35 ` tboegi
  2022-08-14 23:12   ` Junio C Hamano
  2022-08-27  8:50 ` [PATCH v2 " tboegi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: tboegi @ 2022-08-14 13:35 UTC (permalink / raw)
  To: git, alexander.s.m; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

$ git log --stat

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exist for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem: code like this
strbuf_addf(&out, "%-*s", len, name);

(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count, if name is UTF-8 encoded.

We could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The choosen solution is to split code in diff.c like this

strbuf_addf(&out, "%-*s", len, name);

into 2 calls, like this:

strbuf_addf(&out, "%s", name);
if (len > utf8_strwidth(name))
    strbuf_addchars(&out, ' ', len - utf8_strwidth(name));

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..7fb254c545 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
+		size_t num_padding_spaces = 0;
 		int name_len;

 		if (!file->is_interesting && (added + deleted == 0))
@@ -2743,7 +2744,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
@@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		if (len > utf8_strwidth(name))
+			num_padding_spaces = len - utf8_strwidth(name);

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addf(&out, "| %*s", number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addstr(&out, "| Unmerged\n");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
+		strbuf_addf(&out, " %s%s ", prefix,  name);
+		if (num_padding_spaces)
+			strbuf_addchars(&out, ' ', num_padding_spaces);
+		strbuf_addf(&out, "| %*"PRIuMAX"%s",
 			number_width, added + deleted,
 			added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
@ 2022-08-14 23:12   ` Junio C Hamano
  2022-08-15  6:34     ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-08-14 23:12 UTC (permalink / raw)
  To: tboegi; +Cc: git, alexander.s.m

tboegi@web.de writes:

> The choosen solution is to split code in diff.c like this
>
> strbuf_addf(&out, "%-*s", len, name);
>
> into 2 calls, like this:
>
> strbuf_addf(&out, "%s", name);
> if (len > utf8_strwidth(name))
>     strbuf_addchars(&out, ' ', len - utf8_strwidth(name));

Makes sense.  Is utf8_strwidth(name) cheap enough that we can call
it twice in a row on the same string casually, or should we avoid it
with a new variable?

It might be worth doing a helper function, even?

	static inline strbuf_pad(struct strbuf *out, const char *s, size_t width)
	{
		size_t w = utf8_strwidth(s);

		strbuf_addstr(out, s);
		if (w < width)
			strbuf_addchars(out, ' ', width - w);
	}

Other than that, sounds very sensible.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-14 23:12   ` Junio C Hamano
@ 2022-08-15  6:34     ` Torsten Bögershausen
  2022-08-18 21:00       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-15  6:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m

On Sun, Aug 14, 2022 at 04:12:12PM -0700, Junio C Hamano wrote:
> tboegi@web.de writes:
>
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into 2 calls, like this:
> >
> > strbuf_addf(&out, "%s", name);
> > if (len > utf8_strwidth(name))
> >     strbuf_addchars(&out, ' ', len - utf8_strwidth(name));
>
> Makes sense.  Is utf8_strwidth(name) cheap enough that we can call
> it twice in a row on the same string casually, or should we avoid it
> with a new variable?
>
> It might be worth doing a helper function, even?
>
> 	static inline strbuf_pad(struct strbuf *out, const char *s, size_t width)
> 	{
> 		size_t w = utf8_strwidth(s);
>
> 		strbuf_addstr(out, s);
> 		if (w < width)
> 			strbuf_addchars(out, ' ', width - w);
> 	}
>
> Other than that, sounds very sensible.
>

Thanks for the review.

Actually, the commit message is wrong - after writing it, the code
was changed into

if (len > utf8_strwidth(name))
        num_padding_spaces = len - utf8_strwidth(name);

and later

if (num_padding_spaces)
	strbuf_addchars(&out, ' ', num_padding_spaces);

(And having written this, there is probably room for test cases,
IOW: a V2 will come the next days)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-15  6:34     ` Torsten Bögershausen
@ 2022-08-18 21:00       ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-08-18 21:00 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

Torsten Bögershausen <tboegi@web.de> writes:

> (And having written this, there is probably room for test cases,
> IOW: a V2 will come the next days)

Yeah, that all sounds sensible.

Thanks for working on this.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
  2022-08-09 18:20 ` Calvin Wan
  2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
@ 2022-08-27  8:50 ` tboegi
  2022-08-27  8:54   ` Torsten Bögershausen
  2022-08-29 12:04   ` Johannes Schindelin
  2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 42+ messages in thread
From: tboegi @ 2022-08-27  8:50 UTC (permalink / raw)
  To: git, alexander.s.m; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

$ git log --stat

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exist for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem: code like this
strbuf_addf(&out, "%-*s", len, name);

(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count, if name is UTF-8 encoded.

We could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The choosen solution is to split code in diff.c like this

strbuf_addf(&out, "%-*s", len, name);

into something like this:

size_t num_padding_spaces = 0;
// [snip]
if (len > utf8_strwidth(name))
    num_padding_spaces = len - utf8_strwidth(name);
strbuf_addf(&out, "%s", name);
if (num_padding_spaces)
    strbuf_addchars(&out, ' ', num_padding_spaces);

Tests:
Two things need to be tested:
- The calculation of the maximum width
- The calculation of num_padding_spaces

The name "textfile" is changed into "textfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile   |  [snip]
 textfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë |  [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
"textfile" is renamed into "textfilë", 1 byte more in strlen(),
and the updated t4012-diff-binary.sh checks the correct aligment:
 binfïlë  | [snip]
 textfilë | [snip]

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 37 +++++++++++++++++++++++--------------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..cf38e1dc88 100644
--- a/diff.c
+++ b/diff.c
@@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
 static void show_stats(struct diffstat_t *data, struct diff_options *options)
 {
 	int i, len, add, del, adds = 0, dels = 0;
-	uintmax_t max_change = 0, max_len = 0;
+	uintmax_t max_change = 0, max_width = 0;
 	int total_files = data->nr, count;
 	int width, name_width, graph_width, number_width = 0, bin_width = 0;
 	const char *reset, *add_c, *del_c;
@@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
-		if (max_len < len)
-			max_len = len;
+		len = utf8_strwidth(file->print_name);
+		if (max_width < len)
+			max_width = len;

 		if (file->is_unmerged) {
 			/* "Unmerged" is 8 characters */
@@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)

 	/*
 	 * We have width = stat_width or term_columns() columns total.
-	 * We want a maximum of min(max_len, stat_name_width) for the name part.
+	 * We want a maximum of min(max_width, stat_name_width) for the name part.
 	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
 	 * We also need 1 for " " and 4 + decimal_width(max_change)
 	 * for " | NNNN " and one the empty column at the end, altogether
@@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		graph_width = options->stat_graph_width;

 	name_width = (options->stat_name_width > 0 &&
-		      options->stat_name_width < max_len) ?
-		options->stat_name_width : max_len;
+		      options->stat_name_width < max_width) ?
+		options->stat_name_width : max_width;

 	/*
 	 * Adjust adjustable widths not to exceed maximum width
@@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
+		size_t num_padding_spaces = 0;
 		int name_len;

 		if (!file->is_interesting && (added + deleted == 0))
@@ -2743,7 +2744,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
@@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		if (len > utf8_strwidth(name))
+			num_padding_spaces = len - utf8_strwidth(name);

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addf(&out, "| %*s", number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addstr(&out, "| Unmerged\n");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
+		strbuf_addf(&out, " %s%s ", prefix,  name);
+		if (num_padding_spaces)
+			strbuf_addchars(&out, ' ', num_padding_spaces);
+		strbuf_addf(&out, "| %*"PRIuMAX"%s",
 			number_width, added + deleted,
 			added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index c509143c81..2d49de01c8 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
 '

 cat >expect <<EOF
- binfile  |   Bin 0 -> 1026 bytes
- textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ binfïlë  |   Bin 0 -> 1026 bytes
+ textfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 EOF

 test_expect_success 'diff --stat with binary files and big change count' '
-	printf "\01\00%1024d" 1 >binfile &&
-	git add binfile &&
+	printf "\01\00%1024d" 1 >binfïlë &&
+	git add binfïlë &&
 	i=0 &&
 	while test $i -lt 10000; do
 		echo $i &&
 		i=$(($i + 1)) || return 1
-	done >textfile &&
-	git add textfile &&
-	git diff --cached --stat binfile textfile >output &&
+	done >textfilë &&
+	git add textfilë &&
+	git -c core.quotepath=false diff --cached --stat binfïlë textfilë >output &&
 	grep " | " output >actual &&
 	test_cmp expect actual
 '
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-27  8:50 ` [PATCH v2 " tboegi
@ 2022-08-27  8:54   ` Torsten Bögershausen
  2022-08-27  9:50     ` Eric Sunshine
  2022-08-29 12:04   ` Johannes Schindelin
  1 sibling, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-27  8:54 UTC (permalink / raw)
  To: git, alexander.s.m

On Sat, Aug 27, 2022 at 10:50:07AM +0200, tboegi@web.de wrote:
> From: Torsten Bögershausen <tboegi@web.de>
>


> b) The same problem exist for all unicode code points (so we

That should be "exists". Let's see if there are more comments,
before sending a new patch.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-27  8:54   ` Torsten Bögershausen
@ 2022-08-27  9:50     ` Eric Sunshine
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Sunshine @ 2022-08-27  9:50 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Git List, alexander.s.m

On Sat, Aug 27, 2022 at 4:58 AM Torsten Bögershausen <tboegi@web.de> wrote:
> On Sat, Aug 27, 2022 at 10:50:07AM +0200, tboegi@web.de wrote:
> > From: Torsten Bögershausen <tboegi@web.de>
> > b) The same problem exist for all unicode code points (so we
>
> That should be "exists". Let's see if there are more comments,
> before sending a new patch.

Here's one:

> The choosen solution is to split code in diff.c like this

s/choosen/chosen/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-27  8:50 ` [PATCH v2 " tboegi
  2022-08-27  8:54   ` Torsten Bögershausen
@ 2022-08-29 12:04   ` Johannes Schindelin
  2022-08-29 17:54     ` Torsten Bögershausen
  1 sibling, 1 reply; 42+ messages in thread
From: Johannes Schindelin @ 2022-08-29 12:04 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

[-- Attachment #1: Type: text/plain, Size: 10474 bytes --]

Hi Torsten,

On Sat, 27 Aug 2022, tboegi@web.de wrote:

> From: Torsten Bögershausen <tboegi@web.de>
>
> When unicode filenames (encoded in UTF-8) are used, the visible width
> on the screen is not the same as strlen(filename).
>
> For example, `git log --stat` may produce an output like this:
>
> $ git log --stat
>
> [snip the header]
>
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>  2 files changed, 2 insertions(+)
>
> A side note: the original report was about cyrillic filenames.
> After some investigations it turned out that
> a) This is not a problem with "ambiguous characters" in unicode
> b) The same problem exist for all unicode code points (so we
>   can use Latin based Umlauts for demonstrations below)
>
> The 'Ä' takes the same space on the screen as the 'A'.
> But needs one more byte in memory, so the the `git log --stat` output
> for "Arger.txt" (!) gets mis-aligned:
> The maximum length is derived from "Ärger.txt", 10 bytes in memory,
> 9 positions on the screen. That is why "Arger.txt" gets one extra ' '
> for aligment, it needs 9 bytes in memory.
> If there was a file "Ö", it would be correctly aligned by chance,
> but "Öhö" would not.
>
> The solution is of course, to use utf8_strwidth() instead of strlen()
> when dealing with the width on screen.
>
> And then there is another problem: code like this
> strbuf_addf(&out, "%-*s", len, name);
>
> (or using the underlying snprintf() function) does not align the
> buffer to a minimum of len measured in screen-width, but uses the
> memory count, if name is UTF-8 encoded.
>
> We could be tempted to wish that snprintf() was UTF-8 aware.
> That doesn't seem to be the case anywhere (tested on Linux and Mac),
> probably snprintf() uses the "bytes in memory"/strlen() approach to be
> compatible with older versions and this will never change.

An interesting read so, far, but...

>
> The choosen solution is to split code in diff.c like this
>
> strbuf_addf(&out, "%-*s", len, name);
>
> into something like this:
>
> size_t num_padding_spaces = 0;
> // [snip]
> if (len > utf8_strwidth(name))
>     num_padding_spaces = len - utf8_strwidth(name);
> strbuf_addf(&out, "%s", name);
> if (num_padding_spaces)
>     strbuf_addchars(&out, ' ', num_padding_spaces);

... this sounds like it would benefit from beinv refactored into a
separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
readability as well as for self-documentation.

Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
_twice_ and why we do not assign the result to a variable called `width`
and then have a conditional like

	if (width < len) /* pad to `len` columns */
		strbuf_addchars(&out, ' ' , len - width);

instead. That would sound more logical to me.

Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
so different from changing `strbuf_addf(...)`, I would prefer to see them
split into two patches.

>
> Tests:
> Two things need to be tested:
> - The calculation of the maximum width
> - The calculation of num_padding_spaces
>
> The name "textfile" is changed into "textfilë", both have a width of 8.
> If strlen() was used, to get the maximum width, the shorter "binfile" would
> have been mis-aligned:
>  binfile   |  [snip]
>  textfilë | [snip]
>
> If only "binfile" would be renamed into "binfilë":
>  binfilë |  [snip]
>  textfile | [snip]
>
> In order to verify that the width is calculated correctly everywhere,
> "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> and the updated t4012-diff-binary.sh checks the correct aligment:
>  binfïlë  | [snip]
>  textfilë | [snip]

I wonder whether you can change only _one_ name and still verify the
correctness. When you make two changes at the same time, it is always
possible for one change to "cancel out" the other one, and therefore it is
harder to reason about the correctness of your patch.

Better keep it simple and change only one instance (personally, I would
have changed two letters in the longer one).

>
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c                 | 37 +++++++++++++++++++++++--------------
>  t/t4012-diff-binary.sh | 14 +++++++-------
>  2 files changed, 30 insertions(+), 21 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 974626a621..cf38e1dc88 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
>  static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  {
>  	int i, len, add, del, adds = 0, dels = 0;
> -	uintmax_t max_change = 0, max_len = 0;
> +	uintmax_t max_change = 0, max_width = 0;

Why rename `max_len`, but not `len`? I would have expected (and agreed
with seeing) `len` to be renamed to `width`, too.

>  	int total_files = data->nr, count;
>  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
>  	const char *reset, *add_c, *del_c;
> @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> -		if (max_len < len)
> -			max_len = len;
> +		len = utf8_strwidth(file->print_name);
> +		if (max_width < len)
> +			max_width = len;
>
>  		if (file->is_unmerged) {
>  			/* "Unmerged" is 8 characters */
> @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>
>  	/*
>  	 * We have width = stat_width or term_columns() columns total.
> -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
>  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
>  	 * We also need 1 for " " and 4 + decimal_width(max_change)
>  	 * for " | NNNN " and one the empty column at the end, altogether
> @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		graph_width = options->stat_graph_width;
>
>  	name_width = (options->stat_name_width > 0 &&
> -		      options->stat_name_width < max_len) ?
> -		options->stat_name_width : max_len;
> +		      options->stat_name_width < max_width) ?
> +		options->stat_name_width : max_width;

It is a bit sad that the diff lines regarding the renamed variable drown
out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
result is nicer.

Thank you for working on this!
Dscho

>
>  	/*
>  	 * Adjust adjustable widths not to exceed maximum width
> @@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		char *name = file->print_name;
>  		uintmax_t added = file->added;
>  		uintmax_t deleted = file->deleted;
> +		size_t num_padding_spaces = 0;
>  		int name_len;
>
>  		if (!file->is_interesting && (added + deleted == 0))
> @@ -2743,7 +2744,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> @@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		if (len > utf8_strwidth(name))
> +			num_padding_spaces = len - utf8_strwidth(name);
>
>  		if (file->is_binary) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addf(&out, " %*s", number_width, "Bin");
> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addf(&out, "| %*s", number_width, "Bin");
>  			if (!added && !deleted) {
>  				strbuf_addch(&out, '\n');
>  				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
> @@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		else if (file->is_unmerged) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addstr(&out, " Unmerged\n");
> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addstr(&out, "| Unmerged\n");
>  			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
>  					 out.buf, out.len, 0);
>  			strbuf_reset(&out);
> @@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  				add = total - del;
>  			}
>  		}
> -		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -		strbuf_addf(&out, " %*"PRIuMAX"%s",
> +		strbuf_addf(&out, " %s%s ", prefix,  name);
> +		if (num_padding_spaces)
> +			strbuf_addchars(&out, ' ', num_padding_spaces);
> +		strbuf_addf(&out, "| %*"PRIuMAX"%s",
>  			number_width, added + deleted,
>  			added + deleted ? " " : "");
>  		show_graph(&out, '+', add, add_c, reset);
> diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
> index c509143c81..2d49de01c8 100755
> --- a/t/t4012-diff-binary.sh
> +++ b/t/t4012-diff-binary.sh
> @@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
>  '
>
>  cat >expect <<EOF
> - binfile  |   Bin 0 -> 1026 bytes
> - textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> + binfïlë  |   Bin 0 -> 1026 bytes
> + textfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  EOF
>
>  test_expect_success 'diff --stat with binary files and big change count' '
> -	printf "\01\00%1024d" 1 >binfile &&
> -	git add binfile &&
> +	printf "\01\00%1024d" 1 >binfïlë &&
> +	git add binfïlë &&
>  	i=0 &&
>  	while test $i -lt 10000; do
>  		echo $i &&
>  		i=$(($i + 1)) || return 1
> -	done >textfile &&
> -	git add textfile &&
> -	git diff --cached --stat binfile textfile >output &&
> +	done >textfilë &&
> +	git add textfilë &&
> +	git -c core.quotepath=false diff --cached --stat binfïlë textfilë >output &&
>  	grep " | " output >actual &&
>  	test_cmp expect actual
>  '
> --
> 2.34.0
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-29 12:04   ` Johannes Schindelin
@ 2022-08-29 17:54     ` Torsten Bögershausen
  2022-08-29 18:37       ` Junio C Hamano
  2022-09-02  9:47       ` Johannes Schindelin
  0 siblings, 2 replies; 42+ messages in thread
From: Torsten Bögershausen @ 2022-08-29 17:54 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, alexander.s.m

On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> Hi Torsten,
> >
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into something like this:
> >
> > size_t num_padding_spaces = 0;
> > // [snip]
> > if (len > utf8_strwidth(name))
> >     num_padding_spaces = len - utf8_strwidth(name);
> > strbuf_addf(&out, "%s", name);
> > if (num_padding_spaces)
> >     strbuf_addchars(&out, ' ', num_padding_spaces);
>
> ... this sounds like it would benefit from beinv refactored into a
> separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> readability as well as for self-documentation.

Yes, but:
All (tm) strbuf() functions use an unsigned size_t, and are not
tolerant against passing 0 as "do nothing".
A nicer solution (for this patch) could be a change like this:
Instead of

void strbuf_addchars(struct strbuf *sb, int c, size_t n)
{
        strbuf_grow(sb, n);
	memset(sb->buf + sb->len, c, n);
	strbuf_setlen(sb, sb->len + n);
}

We would find:
void strbuf_addchars(struct strbuf *sb, int c, ssize_t n)
{
        if (n <= 0)
	       return;
        strbuf_grow(sb, (size_t)n);
	memset(sb->buf + sb->len, c, (size_t)n);
	strbuf_setlen(sb, sb->len + (size_t)n);
}

I couldn't convince myself to do so.
Since it is mainly diff.c that needs this adjustment/padding of strings,
I coulnd't convince myself to write another function in strbuf.c


>
> Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> _twice_ and why we do not assign the result to a variable called `width`
> and then have a conditional like
>
> 	if (width < len) /* pad to `len` columns */
> 		strbuf_addchars(&out, ' ' , len - width);
>
> instead. That would sound more logical to me.

This is caused by the logic in diff.c:
  /*
   * Find the longest filename and max number of changes
   */
   for (i = 0; (i < count) && (i < data->nr); i++) {
       struct diffstat_file *file = data->files[i];
       [snip]
       len = utf8_strwidth(file->print_name);
       if (max_width < len)
          max_width = len;
// and later
    /*
     * From here name_width is the width of the name area,
     * and graph_width is the width of the graph area.
     * max_change is used to scale graph properly.
     */
    for (i = 0; i < count; i++) {
    /*
     * "scale" the filename
     */
     // TB: Which means either shortening it with ...
     // Or padding it, if needed, and here we need
     // another
     name_len = utf8_strwidth(name);

>
> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
> so different from changing `strbuf_addf(...)`, I would prefer to see them
> split into two patches.

Hm, that is a possiblity. Seems to ease the burden for reviewers.

>
> >
> > Tests:
> > Two things need to be tested:
> > - The calculation of the maximum width
> > - The calculation of num_padding_spaces
> >
> > The name "textfile" is changed into "textfilë", both have a width of 8.
> > If strlen() was used, to get the maximum width, the shorter "binfile" would
> > have been mis-aligned:
> >  binfile   |  [snip]
> >  textfilë | [snip]
> >
> > If only "binfile" would be renamed into "binfilë":
> >  binfilë |  [snip]
> >  textfile | [snip]
> >
> > In order to verify that the width is calculated correctly everywhere,
> > "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> > "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> > and the updated t4012-diff-binary.sh checks the correct aligment:
> >  binfïlë  | [snip]
> >  textfilë | [snip]
>
> I wonder whether you can change only _one_ name and still verify the
> correctness. When you make two changes at the same time, it is always
> possible for one change to "cancel out" the other one, and therefore it is
> harder to reason about the correctness of your patch.

Nee, I have a hard time to see how a +/- 1 can "cancel out" a +/- 2.
But I may improve the commit message, to make that more clear.

>
> Better keep it simple and change only one instance (personally,
> I would have changed two letters in the longer one).

That is certainly doable.


>
> >
> > Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> > Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> > ---
> >  diff.c                 | 37 +++++++++++++++++++++++--------------
> >  t/t4012-diff-binary.sh | 14 +++++++-------
> >  2 files changed, 30 insertions(+), 21 deletions(-)
> >
> > diff --git a/diff.c b/diff.c
> > index 974626a621..cf38e1dc88 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
> >  static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  {
> >  	int i, len, add, del, adds = 0, dels = 0;
> > -	uintmax_t max_change = 0, max_len = 0;
> > +	uintmax_t max_change = 0, max_width = 0;
>
> Why rename `max_len`, but not `len`? I would have expected (and agreed
> with seeing) `len` to be renamed to `width`, too.

That is a valid point.
There is, however, already a variable called "width".
And renaming this one into a new one as well ?

>
> >  	int total_files = data->nr, count;
> >  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
> >  	const char *reset, *add_c, *del_c;
> > @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  			continue;
> >  		}
> >  		fill_print_name(file);
> > -		len = strlen(file->print_name);
> > -		if (max_len < len)
> > -			max_len = len;
> > +		len = utf8_strwidth(file->print_name);
> > +		if (max_width < len)
> > +			max_width = len;
> >
> >  		if (file->is_unmerged) {
> >  			/* "Unmerged" is 8 characters */
> > @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >
> >  	/*
> >  	 * We have width = stat_width or term_columns() columns total.
> > -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> > +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
> >  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
> >  	 * We also need 1 for " " and 4 + decimal_width(max_change)
> >  	 * for " | NNNN " and one the empty column at the end, altogether
> > @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  		graph_width = options->stat_graph_width;
> >
> >  	name_width = (options->stat_name_width > 0 &&
> > -		      options->stat_name_width < max_len) ?
> > -		options->stat_name_width : max_len;
> > +		      options->stat_name_width < max_width) ?
> > +		options->stat_name_width : max_width;
>
> It is a bit sad that the diff lines regarding the renamed variable drown
> out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
> result is nicer.
>
> Thank you for working on this!
> Dscho

Thanks so much for the review - let's see if I can make a better patch
the next days (better say weeks)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-29 17:54     ` Torsten Bögershausen
@ 2022-08-29 18:37       ` Junio C Hamano
  2022-09-02  9:47       ` Johannes Schindelin
  1 sibling, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-08-29 18:37 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Johannes Schindelin, git, alexander.s.m

Torsten Bögershausen <tboegi@web.de> writes:

> This is caused by the logic in diff.c:
>   /*
>    * Find the longest filename and max number of changes
>    */
>    for (i = 0; (i < count) && (i < data->nr); i++) {
>        struct diffstat_file *file = data->files[i];
>        [snip]
>        len = utf8_strwidth(file->print_name);
>        if (max_width < len)
>           max_width = len;
> // and later
>     /*
>      * From here name_width is the width of the name area,
>      * and graph_width is the width of the graph area.
>      * max_change is used to scale graph properly.
>      */
>     for (i = 0; i < count; i++) {
>     /*
>      * "scale" the filename
>      */
>      // TB: Which means either shortening it with ...
>      // Or padding it, if needed, and here we need
>      // another
>      name_len = utf8_strwidth(name);
>
>>
>> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
>> so different from changing `strbuf_addf(...)`, I would prefer to see them
>> split into two patches.
>
> Hm, that is a possiblity. Seems to ease the burden for reviewers.

Another thing I remembered (this is a comment primarily on the
original I wrote based on 'all world is ASCII' mindset that led to
the use of strlen() as a display-width indicator) in the code is
that we "abbreviate" an overly long pathname and transform renames
that originally is in the a/b/c -> a/B/c form into a/{b->B}/c form,
and IIRC they are all byte based.  The latter may be OK because the
transformation is limited to '/' boundary, but the former may chomp
a single multi-byte letter in the middle, which would need to be
corrected as a part of this change.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
                   ` (2 preceding siblings ...)
  2022-08-27  8:50 ` [PATCH v2 " tboegi
@ 2022-09-02  4:21 ` tboegi
  2022-09-02  9:39   ` Johannes Schindelin
  2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: tboegi @ 2022-09-02  4:21 UTC (permalink / raw)
  To: git, alexander.s.m, Johannes.Schindelin; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exists for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

Side note 1:
Needed changes for this fix are split into 2 commits:
This commit only changes strlen() into utf8_strwidth() in diff.c:
The next commit will add tests and further needed changes.

Side note 2:
Junio C Hamano suspects that there is probably more work to be done,
in a separate commit:
Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
and "transforms" renames lines like
"a/b/c -> a/B/c" into the shorter
"a/{b->B}/c" form, and IIRC this is all byte based.

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..b5df464de5 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth()
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
                   ` (3 preceding siblings ...)
  2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
@ 2022-09-02  4:21 ` tboegi
  2022-09-02 10:12   ` Johannes Schindelin
  2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: tboegi @ 2022-09-02  4:21 UTC (permalink / raw)
  To: git, alexander.s.m, Johannes.Schindelin; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

The last commit uses utf8_strwidth() instead of strlen() in diff.c
and it is time to test the change.

And now we detect another problem that is fixed here: code like this
strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count.

One could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The chosen solution is to split code in diff.c like this
strbuf_addf(&out, "%-*s", len, name);

into something like this:
size_t num_padding_spaces = 0;
// [snip]
if (len > utf8_strwidth(name))
    num_padding_spaces = len - utf8_strwidth(name);
strbuf_addf(&out, "%s", name);
if (num_padding_spaces)
    strbuf_addchars(&out, ' ', num_padding_spaces);

I couldn't convince myself to write a wrapper here that is
"easy to read and understandable" and would fit nicely into the chain of
strbuf_addX() calls used in diff.c

Tests:
Two things need to be tested:
 - The calculation of the maximum width
 - The calculation of num_padding_spaces

The name "textfile" is changed into "tëxtfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile    |  [snip]
 tëxtfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë |  [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfilë", giving 1 bytes more in strlen()
"tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen().

The updated t4012-diff-binary.sh checks the correct aligment:
 binfilë  | [snip]
 tëxtfilë | [snip]

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 33 +++++++++++++++++++++------------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/diff.c b/diff.c
index b5df464de5..cf38e1dc88 100644
--- a/diff.c
+++ b/diff.c
@@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
 static void show_stats(struct diffstat_t *data, struct diff_options *options)
 {
 	int i, len, add, del, adds = 0, dels = 0;
-	uintmax_t max_change = 0, max_len = 0;
+	uintmax_t max_change = 0, max_width = 0;
 	int total_files = data->nr, count;
 	int width, name_width, graph_width, number_width = 0, bin_width = 0;
 	const char *reset, *add_c, *del_c;
@@ -2621,8 +2621,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		}
 		fill_print_name(file);
 		len = utf8_strwidth(file->print_name);
-		if (max_len < len)
-			max_len = len;
+		if (max_width < len)
+			max_width = len;

 		if (file->is_unmerged) {
 			/* "Unmerged" is 8 characters */
@@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)

 	/*
 	 * We have width = stat_width or term_columns() columns total.
-	 * We want a maximum of min(max_len, stat_name_width) for the name part.
+	 * We want a maximum of min(max_width, stat_name_width) for the name part.
 	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
 	 * We also need 1 for " " and 4 + decimal_width(max_change)
 	 * for " | NNNN " and one the empty column at the end, altogether
@@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		graph_width = options->stat_graph_width;

 	name_width = (options->stat_name_width > 0 &&
-		      options->stat_name_width < max_len) ?
-		options->stat_name_width : max_len;
+		      options->stat_name_width < max_width) ?
+		options->stat_name_width : max_width;

 	/*
 	 * Adjust adjustable widths not to exceed maximum width
@@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
+		size_t num_padding_spaces = 0;
 		int name_len;

 		if (!file->is_interesting && (added + deleted == 0))
@@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		if (len > utf8_strwidth(name))
+			num_padding_spaces = len - utf8_strwidth(name);

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addf(&out, "| %*s", number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addstr(&out, "| Unmerged\n");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
+		strbuf_addf(&out, " %s%s ", prefix,  name);
+		if (num_padding_spaces)
+			strbuf_addchars(&out, ' ', num_padding_spaces);
+		strbuf_addf(&out, "| %*"PRIuMAX"%s",
 			number_width, added + deleted,
 			added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index c509143c81..c64d9d2f40 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
 '

 cat >expect <<EOF
- binfile  |   Bin 0 -> 1026 bytes
- textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ binfilë  |   Bin 0 -> 1026 bytes
+ tëxtfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 EOF

 test_expect_success 'diff --stat with binary files and big change count' '
-	printf "\01\00%1024d" 1 >binfile &&
-	git add binfile &&
+	printf "\01\00%1024d" 1 >binfilë &&
+	git add binfilë &&
 	i=0 &&
 	while test $i -lt 10000; do
 		echo $i &&
 		i=$(($i + 1)) || return 1
-	done >textfile &&
-	git add textfile &&
-	git diff --cached --stat binfile textfile >output &&
+	done >tëxtfilë &&
+	git add tëxtfilë &&
+	git -c core.quotepath=false diff --cached --stat binfilë tëxtfilë >output &&
 	grep " | " output >actual &&
 	test_cmp expect actual
 '
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
@ 2022-09-02  9:39   ` Johannes Schindelin
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2022-09-02  9:39 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

Hi Torsten,

On Fri, 2 Sep 2022, tboegi@web.de wrote:

> diff --git a/diff.c b/diff.c
> index 974626a621..b5df464de5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);

So this is no longer a length (in bytes) but a width (in columns).

In 2/2, a similar change incurs renaming `max_len` to `max_width`.

I would prefer for 1/2 and 2/2 to be on the same page here: either they
both rename variables that have `len` in their name but are actually about
a width (in columns), or neither of the patches rename these variables.

Thanks,
Dscho

>  		if (max_len < len)
>  			max_len = len;
>
> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> --
> 2.34.0
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-29 17:54     ` Torsten Bögershausen
  2022-08-29 18:37       ` Junio C Hamano
@ 2022-09-02  9:47       ` Johannes Schindelin
  1 sibling, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2022-09-02  9:47 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

[-- Attachment #1: Type: text/plain, Size: 2915 bytes --]

Hi Torsten,

On Mon, 29 Aug 2022, Torsten Bögershausen wrote:

> On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> > >
> > > The choosen solution is to split code in diff.c like this
> > >
> > > strbuf_addf(&out, "%-*s", len, name);
> > >
> > > into something like this:
> > >
> > > size_t num_padding_spaces = 0;
> > > // [snip]
> > > if (len > utf8_strwidth(name))
> > >     num_padding_spaces = len - utf8_strwidth(name);
> > > strbuf_addf(&out, "%s", name);
> > > if (num_padding_spaces)
> > >     strbuf_addchars(&out, ' ', num_padding_spaces);
> >
> > ... this sounds like it would benefit from beinv refactored into a
> > separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> > readability as well as for self-documentation.
>
> Yes, but:
> All (tm) strbuf() functions use an unsigned size_t, and are not
> tolerant against passing 0 as "do nothing".

I am missing something, as this seems not to contradict the idea of
`strbuf_add_padded()`. Simply provide the desired width as a `size_t`,
compare the width of the actual added string, and if it is shorter, pad
with spaces. At no stage does this require a signed type, all involved
values are strictly non-negative.

> >
> > Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> > _twice_ and why we do not assign the result to a variable called `width`
> > and then have a conditional like
> >
> > 	if (width < len) /* pad to `len` columns */
> > 		strbuf_addchars(&out, ' ' , len - width);
> >
> > instead. That would sound more logical to me.
>
> This is caused by the logic in diff.c:
>   /*
>    * Find the longest filename and max number of changes
>    */
>    for (i = 0; (i < count) && (i < data->nr); i++) {
>        struct diffstat_file *file = data->files[i];
>        [snip]
>        len = utf8_strwidth(file->print_name);
>        if (max_width < len)
>           max_width = len;
> // and later
>     /*
>      * From here name_width is the width of the name area,
>      * and graph_width is the width of the graph area.
>      * max_change is used to scale graph properly.
>      */
>     for (i = 0; i < count; i++) {
>     /*
>      * "scale" the filename
>      */
>      // TB: Which means either shortening it with ...
>      // Or padding it, if needed, and here we need
>      // another
>      name_len = utf8_strwidth(name);

I was referring to this part of the commit message:

	if (len > utf8_strwidth(name))
		num_padding_spaces = len - utf8_strwidth(name);

Here, we evaluate `utf8_strwidth(name)`, compare it to `len`, and if the
former was smaller, we evaluate the same function call _again_.

What my feedback intended to suggest was to store the result and reuse it:

	name_width = utf8_strwidth(name);
	if (name_width < len)
		num_padding_spaces = len - name_width;

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth()
  2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
@ 2022-09-02 10:12   ` Johannes Schindelin
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2022-09-02 10:12 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

Hi Torsten,

On Fri, 2 Sep 2022, tboegi@web.de wrote:

> diff --git a/diff.c b/diff.c
> index b5df464de5..cf38e1dc88 100644
> --- a/diff.c
> +++ b/diff.c
> [...]
> @@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		if (len > utf8_strwidth(name))
> +			num_padding_spaces = len - utf8_strwidth(name);

Here, we determine how many spaces are needed for padding. The value is
later used in three instances, and from the diff it is not immediately
obvious that all code paths are covered. I did verify locally that this is
the case, though, so all is good.

>
>  		if (file->is_binary) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addf(&out, " %*s", number_width, "Bin");

This was already a bit wasteful by calling `strbuf_addf()` twice, where
one time would have sufficed. (This applies to the other two code paths
below, too.)

> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addf(&out, "| %*s", number_width, "Bin");

Instead of fixing this, we now add yet another `strbuf*()` call.

But this could be done more elegantly, via a single `strbuf_addf()` call:

			strbuf_addf(&out, "%s%s%*s | %*s",
				    prefix, name, num_padding_spaces, "",
				    number_width, "Bin");

By the way, it would flow much better, I think, if we used the
short-and-sweet variable name `padding` instead of `num_padding_spaces`.

>  			if (!added && !deleted) {
>  				strbuf_addch(&out, '\n');
>  				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
> @@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		else if (file->is_unmerged) {
> -			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -			strbuf_addstr(&out, " Unmerged\n");
> +			strbuf_addf(&out, " %s%s ", prefix,  name);
> +			if (num_padding_spaces)
> +				strbuf_addchars(&out, ' ', num_padding_spaces);
> +			strbuf_addstr(&out, "| Unmerged\n");

This can become

			strbuf_addf(&out, " %s%s%*s | Unmerged",
				    prefix, name, padding, "");

instead.

>  			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
>  					 out.buf, out.len, 0);
>  			strbuf_reset(&out);
> @@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  				add = total - del;
>  			}
>  		}
> -		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
> -		strbuf_addf(&out, " %*"PRIuMAX"%s",
> +		strbuf_addf(&out, " %s%s ", prefix,  name);
> +		if (num_padding_spaces)
> +			strbuf_addchars(&out, ' ', num_padding_spaces);
> +		strbuf_addf(&out, "| %*"PRIuMAX"%s",
>  			number_width, added + deleted,
>  			added + deleted ? " " : "");

And this reads better as

		strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX"%s",
			    prefix, name, padding, "",
			    number_width, added + deleted,
			    added + deleted ? " " : "");

If we modify the code in this manner, we avoid repeating a pretty
unreadable pattern three times, using a much more readable pattern
instead.

Random note: The existing code (not your fault) is hard to follow because
it calls `show_graph()` for `add` and `del` always, even if their counts
are zero (in which case `show_graph()` returns early), while the
separating space is appended in the otherwise unrelated `strbuf_addf()`
call before that, but uses the (unscaled) `added + deleted` as condition
for that separator. It would be much easier to follow like this:

		strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX",
			    prefix, name, padding, "",
			    number_width, added + deleted);

		if (add || del) {
			strbuf_addch(&out, ' ');
			show_graph(&out, '+', add, add_c, reset);
			show_graph(&out, '-', del, del_c, reset);
		}

But I consider this #leftoverbits, not something to burden your
contribution with.

Ciao,
Dscho

>  		show_graph(&out, '+', add, add_c, reset);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
                   ` (4 preceding siblings ...)
  2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
@ 2022-09-03  5:39 ` tboegi
  2022-09-05 20:46   ` Junio C Hamano
  2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
  2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
  7 siblings, 1 reply; 42+ messages in thread
From: tboegi @ 2022-09-03  5:39 UTC (permalink / raw)
  To: git, alexander.s.m, Johannes.Schindelin; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exists for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

Side note 1:
Needed changes for this fix are split into 2 commits:
This commit only changes strlen() into utf8_strwidth() in diff.c:
The next commit will add tests and further needed changes.

Side note 2:
Junio C Hamano suspects that there is probably more work to be done,
in a separate commit:
Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
and "transforms" renames lines like
"a/b/c -> a/B/c" into the shorter
"a/{b->B}/c" form, and IIRC this is all byte based.

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..b5df464de5 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth()
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
                   ` (5 preceding siblings ...)
  2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
@ 2022-09-03  5:39 ` tboegi
  2022-09-05 10:13   ` Johannes Schindelin
  2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
  7 siblings, 1 reply; 42+ messages in thread
From: tboegi @ 2022-09-03  5:39 UTC (permalink / raw)
  To: git, alexander.s.m, Johannes.Schindelin; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

The last commit uses utf8_strwidth() instead of strlen() in diff.c
and it is time to test the change.

And now we detect another problem that is fixed here: code like this
strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count.

One could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The basic idea is to change code in diff.c like this
strbuf_addf(&out, "%-*s", len, name);

into something like this:
int padding = len - utf8_strwidth(name);
if (padding < 0)
	padding = 0;
strbuf_addf(&out, " %s%*s", name, padding, "");

The real change is slighty bigger, as it, as well, integrates two calls
of strbuf_addf() into one.

Tests:
Two things need to be tested:
 - The calculation of the maximum width
 - The calculation of padding

The name "textfile" is changed into "tëxtfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile    | [snip]
 tëxtfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë | [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfilë", giving 1 bytes more in strlen()
"tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen().

The updated t4012-diff-binary.sh checks the correct aligment:
 binfilë  | [snip]
 tëxtfilë | [snip]

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 23 ++++++++++++++---------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/diff.c b/diff.c
index b5df464de5..35b9da90fe 100644
--- a/diff.c
+++ b/diff.c
@@ -2734,7 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
-		int name_len;
+		int name_len, padding;

 		if (!file->is_interesting && (added + deleted == 0))
 			continue;
@@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		padding = len - utf8_strwidth(name);
+		if (padding < 0)
+			padding = 0;

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2780,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Unmerged");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,10 +2808,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
-			number_width, added + deleted,
-			added + deleted ? " " : "");
+		strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX"%s",
+			    prefix, name, padding, "",
+			    number_width, added + deleted,
+			    added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
 		show_graph(&out, '-', del, del_c, reset);
 		strbuf_addch(&out, '\n');
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index c509143c81..c64d9d2f40 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
 '

 cat >expect <<EOF
- binfile  |   Bin 0 -> 1026 bytes
- textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ binfilë  |   Bin 0 -> 1026 bytes
+ tëxtfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 EOF

 test_expect_success 'diff --stat with binary files and big change count' '
-	printf "\01\00%1024d" 1 >binfile &&
-	git add binfile &&
+	printf "\01\00%1024d" 1 >binfilë &&
+	git add binfilë &&
 	i=0 &&
 	while test $i -lt 10000; do
 		echo $i &&
 		i=$(($i + 1)) || return 1
-	done >textfile &&
-	git add textfile &&
-	git diff --cached --stat binfile textfile >output &&
+	done >tëxtfilë &&
+	git add tëxtfilë &&
+	git -c core.quotepath=false diff --cached --stat binfilë tëxtfilë >output &&
 	grep " | " output >actual &&
 	test_cmp expect actual
 '
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth()
  2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
@ 2022-09-05 10:13   ` Johannes Schindelin
  0 siblings, 0 replies; 42+ messages in thread
From: Johannes Schindelin @ 2022-09-05 10:13 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m

Hi Torsten,

thank you for working on a new iteration!

On Sat, 3 Sep 2022, tboegi@web.de wrote:

> [...]
> diff --git a/diff.c b/diff.c
> index b5df464de5..35b9da90fe 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2734,7 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		char *name = file->print_name;
>  		uintmax_t added = file->added;
>  		uintmax_t deleted = file->deleted;
> -		int name_len;
> +		int name_len, padding;

I had a look and `len` is also declard as an `int`.

>
>  		if (!file->is_interesting && (added + deleted == 0))
>  			continue;
> @@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		padding = len - utf8_strwidth(name);
> +		if (padding < 0)
> +			padding = 0;

I would have had a slight preference for something like this:

		int name_len = utf8_strwidth(name);
		int padding = name_len < len ? len - name_len : 0;

i.e. avoid the potentially negative difference. (Ideally, I would have
liked the type to be changed to `size_t`, but that is impractical due to
the variables' use in `%.*s` formats.)

But it is not worth a new iteration on its own, and I am very happy with
the current iteration.

Thanks!
Dscho

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
@ 2022-09-05 20:46   ` Junio C Hamano
  2022-09-07  4:30     ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-05 20:46 UTC (permalink / raw)
  To: tboegi; +Cc: git, alexander.s.m, Johannes.Schindelin

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
> Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1

Given 2/2 does not share a similar title, "part1" sounds somewhat
strange.  In any case, 'when appropriate,' is probalby best unsaid,
as it is almost a given.  We won't deliberately use something that
is not appropriate on purpose anyway.  Even if we =were to keep that
word, downcase "When".


> When unicode filenames (encoded in UTF-8) are used, the visible width
> on the screen is not the same as strlen(filename).
>
> For example, `git log --stat` may produce an output like this:
>
> [snip the header]
>
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>  2 files changed, 2 insertions(+)
>
> A side note: the original report was about cyrillic filenames.
> After some investigations it turned out that
> a) This is not a problem with "ambiguous characters" in unicode
> b) The same problem exists for all unicode code points (so we
>   can use Latin based Umlauts for demonstrations below)
>
> The 'Ä' takes the same space on the screen as the 'A'.
> But needs one more byte in memory, so the the `git log --stat` output
> for "Arger.txt" (!) gets mis-aligned:
> The maximum length is derived from "Ärger.txt", 10 bytes in memory,
> 9 positions on the screen. That is why "Arger.txt" gets one extra ' '
> for aligment, it needs 9 bytes in memory.
> If there was a file "Ö", it would be correctly aligned by chance,
> but "Öhö" would not.
>
> The solution is of course, to use utf8_strwidth() instead of strlen()
> when dealing with the width on screen.
>
> Side note 1:
> Needed changes for this fix are split into 2 commits:
> This commit only changes strlen() into utf8_strwidth() in diff.c:
> The next commit will add tests and further needed changes.

I am not sure if it makes sense to split them into two.  It is hard
for us to demonistrate the need for this step if it does not come
with its own test.

> Side note 2:
> Junio C Hamano suspects that there is probably more work to be done,
> in a separate commit:
> Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
> and "transforms" renames lines like
> "a/b/c -> a/B/c" into the shorter
> "a/{b->B}/c" form, and IIRC this is all byte based.

I already said that I suspect {b->B} conversion is OK, so the side
note is probably more noise than being useful.
>
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 974626a621..b5df464de5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;
>
> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> --
> 2.34.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-09-05 20:46   ` Junio C Hamano
@ 2022-09-07  4:30     ` Torsten Bögershausen
  2022-09-07 18:31       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-09-07  4:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m, Johannes.Schindelin

On Mon, Sep 05, 2022 at 01:46:57PM -0700, Junio C Hamano wrote:
> tboegi@web.de writes:
>
> > From: Torsten Bögershausen <tboegi@web.de>
> > Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
>
> Given 2/2 does not share a similar title, "part1" sounds somewhat
> strange.  In any case, 'when appropriate,' is probalby best unsaid,
> as it is almost a given.  We won't deliberately use something that
> is not appropriate on purpose anyway.  Even if we =were to keep that
> word, downcase "When".

Yes, agreed. In short: I will make a new patch the next weeks,
in one commit (again). (That can take some days or weeks)

Thanks to Dscho for his patience with the strbuf() improvements.
I think that I tried a "%*s" version, but couldn't get that to work.

> > Side note 2:
> > Junio C Hamano suspects that there is probably more work to be done,
> > in a separate commit:
> > Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
> > and "transforms" renames lines like
> > "a/b/c -> a/B/c" into the shorter
> > "a/{b->B}/c" form, and IIRC this is all byte based.
>
> I already said that I suspect {b->B} conversion is OK, so the side
> note is probably more noise than being useful.

OK - the comment can be removed.

I didn't know how to read this comment:
>...but the former may chomp a single multi-byte letter in the middle,
> which would need to be corrected as a part of this change.

After diffing into the code some more times, I think that we don't
chomp a single byte out of an UTF-8 sequence.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
  2022-09-07  4:30     ` Torsten Bögershausen
@ 2022-09-07 18:31       ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-09-07 18:31 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

>
> OK - the comment can be removed.
>
> I didn't know how to read this comment:
>>...but the former may chomp a single multi-byte letter in the middle,
>> which would need to be corrected as a part of this change.
>
> After diffing into the code some more times, I think that we don't
> chomp a single byte out of an UTF-8 sequence.

When turning a/b/c vs a/B/c into a/{b->B}/c, two steps are involved.
Take common prefix and suffix (in this case 'a' and 'c') and turn
'b' vs 'B' into {b->B} is one step.  The other is what to do when
prefix and suffix are long.  After turning aaaaa/b/c vs aaaaa/B/c
into aaaaa/{b->B}/c, if the result is overly long, how we shorten
the prefix (i.e. aaaaa) and the suffix?

I knew the code that produces {b->B} honored '/' boundary, but I
just did not remember offhand what diff.c::pprint_rename() did in
its latter half, specifically, if it just chomped pfx and sfx as a
sequence of bytes (which would have been wrong) or insisted that the
common sequence search honors '/' boundary (which would be OK, as
byte '/' will not appear in the middle of a single multi-byte UTF-8
"letter").  I think iti s doing the latter, so it should be fine.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
                   ` (6 preceding siblings ...)
  2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
@ 2022-09-14 15:13 ` tboegi
  2022-09-14 16:40   ` Junio C Hamano
  2022-09-15  2:57   ` Junio C Hamano
  7 siblings, 2 replies; 42+ messages in thread
From: tboegi @ 2022-09-14 15:13 UTC (permalink / raw)
  To: git, alexander.s.m, Johannes.Schindelin; +Cc: Torsten Bögershausen

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen().

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exists for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem, code like this:
strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count.

One could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The basic idea is to change code in diff.c like this
strbuf_addf(&out, "%-*s", len, name);

into something like this:
int padding = len - utf8_strwidth(name);
if (padding < 0)
	padding = 0;
strbuf_addf(&out, " %s%*s", name, padding, "");

The real change is slighty bigger, as it, as well, integrates two calls
of strbuf_addf() into one.

Tests:
Two things need to be tested:
 - The calculation of the maximum width
 - The calculation of padding

The name "textfile" is changed into "tëxtfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile    | [snip]
 tëxtfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë | [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfilë", giving 1 bytes more in strlen()
"tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen().

The updated t4012-diff-binary.sh checks the correct aligment:
 binfilë  | [snip]
 tëxtfilë | [snip]

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 27 ++++++++++++++++-----------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..35b9da90fe 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2734,7 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
-		int name_len;
+		int name_len, padding;

 		if (!file->is_interesting && (added + deleted == 0))
 			continue;
@@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
@@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		padding = len - utf8_strwidth(name);
+		if (padding < 0)
+			padding = 0;

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2780,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Unmerged");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,10 +2808,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
-			number_width, added + deleted,
-			added + deleted ? " " : "");
+		strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX"%s",
+			    prefix, name, padding, "",
+			    number_width, added + deleted,
+			    added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
 		show_graph(&out, '-', del, del_c, reset);
 		strbuf_addch(&out, '\n');
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index c509143c81..c64d9d2f40 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
 '

 cat >expect <<EOF
- binfile  |   Bin 0 -> 1026 bytes
- textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ binfilë  |   Bin 0 -> 1026 bytes
+ tëxtfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 EOF

 test_expect_success 'diff --stat with binary files and big change count' '
-	printf "\01\00%1024d" 1 >binfile &&
-	git add binfile &&
+	printf "\01\00%1024d" 1 >binfilë &&
+	git add binfilë &&
 	i=0 &&
 	while test $i -lt 10000; do
 		echo $i &&
 		i=$(($i + 1)) || return 1
-	done >textfile &&
-	git add textfile &&
-	git diff --cached --stat binfile textfile >output &&
+	done >tëxtfilë &&
+	git add tëxtfilë &&
+	git -c core.quotepath=false diff --cached --stat binfilë tëxtfilë >output &&
 	grep " | " output >actual &&
 	test_cmp expect actual
 '
--
2.34.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
@ 2022-09-14 16:40   ` Junio C Hamano
  2022-09-26 18:43     ` Torsten Bögershausen
  2022-09-15  2:57   ` Junio C Hamano
  1 sibling, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-14 16:40 UTC (permalink / raw)
  To: tboegi; +Cc: git, alexander.s.m, Johannes.Schindelin

> The basic idea is to change code in diff.c like this
> strbuf_addf(&out, "%-*s", len, name);
>
> into something like this:
> int padding = len - utf8_strwidth(name);
> if (padding < 0)
> 	padding = 0;
> strbuf_addf(&out, " %s%*s", name, padding, "");
> ...
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c                 | 27 ++++++++++++++++-----------
>  t/t4012-diff-binary.sh | 14 +++++++-------
>  2 files changed, 23 insertions(+), 18 deletions(-)

> diff --git a/diff.c b/diff.c
> index 974626a621..35b9da90fe 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;

The changes in this patch are isolated to the show_stats() helper
function, and looking at use of "len", "max_len", and "name_len", it
may be a good clean-up to make them based on "width".  A bit of care
needs to be taken because the way existing variables are used is a
bit convoluted at times:

 - "width" already exists.  "len" and "max_len" are used in an early
   loop to eventually derive "name_width".

 - "len" is later used in the loop for each pathname to hold a copy
   of "name_width" that can locally be adjusted to accomodate "..."
   abbreviation/munging of the pathname.

 - "name_width" already exists in addition to "name_len".  The
   former holds how many display columns a pathname can occupy in the
   diffstat output, while the latter is used in a loop to hold the
   display columns of the pathname each iteration is looking at, to
   see if it is wider than "name_width" (in which case there is the
   "..." abbreviation that is NOT UTF-8 aware even after this patch)
   or narrower (in which case we'd do the padding).  As the existing
   "name_width" is how we want to name our variables (i.e. the width
   allocated for names), the "name_len", if we were to follow "len
   misleads us to think it is byte length, so use width instead",
   would need to become something like "this_name_width" (i.e. the
   width of the name of the pathname in this iteration of the loop).

But I am OK to do WITHOUT any such renaming, and I do not want to
see such renaming in the same patch ("preliminary clean-up" or
"clean-up after the dust settles" are good, thoguh).  Counting
display columns correctly is more important.

I think I spotted two remaining "bugs" that are left unfixed with
this patch..

There is "stat_width is -1 (auto)" case, which reads like so:

	if (options->stat_width == -1)
		width = term_columns() - strlen(line_prefix);
	else
		width = options->stat_width ? options->stat_width : 80;

Here line_prefix eventually comes from the "git log --graph" and
shows the colored graph segments on the same output line as the
diffstat.

This patch is probably not making anything worse, but by leaving it
strlen(), it is likely overcounting the width of it.  We can
presumably use utf8_strnwidth() that can optionally be told to be
aware of the ANSI color sequence to count its width correctly to fix
it.

> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";

The code around here between this and the next hunk needs cleaning up.

		if (name_width < name_len) {
			char *slash;
			prefix = "...";
			len -= 3;
			name += name_len - len;
			slash = strchr(name, '/');
			if (slash)
				name = slash;
		}

We found the display columns of the current item "name_len" is wider
than what we allocated "name_width" for the names.  We are going to
chomp as many pathname components from the front as needed, at '/'
boundary, to turn "aaaa/bbbb/cccc/dddd.txt" into ".../cccc/dddd.txt"
to make the result fit.

But the way to ensure that '/' before "cccc" is the one we want (as
opposed to the one before "bbbb" or "dddd") is initially based on
columns (i.e. because we want "...", we first subtract 3 from len
which is a local synonym for name_width and then subtract that from
"name_len", i.e. we ask: "how many display columns do we have in the
current pathname that is excess of what we can afford to allocate?"
The intention is to skip that many columns from the beginning of "name"
and start looking for '/' from there.

But we move "name" pointer by that many *bytes*!  We end up scanning
starting at a middle of a character.  What we look for is '/' and when
we find it we know the byte is a standalone character, so we do not
chomp a character in the middle, but it is very likely that we find
a slash that leaves the remaining string still too long, because
skipping say 2 columns may need skipping 4 bytes, but we only
skipped the same number of bytes as the number of columns we need to
skip.

This is the other remaining bug.

I think this needs to become a loop that loops while the width of
the current suffix is still wider than we can afford, discarding one
leading pathname component at a time at '/', measuring the resulting
width, or something like that.  Something along the lines of this
not-even-compile-tested sketch:

        /* we assume strlen(prefix) == utf8_strwidth(prefix) */
	while (name_width < utf8_strwidth(name) + strlen(prefix)) {
		char *slash;
		if (name[0] == '/')
			name++;
                slash = strchr(name);
		if (slash)
			name = slash;
		else
			break; /* Give Up */
		prefix = "...";
	}

> @@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		padding = len - utf8_strwidth(name);
> +		if (padding < 0)
> +			padding = 0;

Here "len" cannot become "name_length" because the former has to be
narrower than the latter by the display width of "prefix"
(i.e. "..."), so while this looks "strange", it is correct.

I think the remainder of the patch I did not quote looked quite
straight-forward and correct.

Thanks for working on this topic.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
  2022-09-14 16:40   ` Junio C Hamano
@ 2022-09-15  2:57   ` Junio C Hamano
  1 sibling, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-09-15  2:57 UTC (permalink / raw)
  To: tboegi; +Cc: git, alexander.s.m, Johannes.Schindelin

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
> Subject: Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()

Let's retitle it to "diff.c: use utf8_strwidth() to count display width".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-09-14 16:40   ` Junio C Hamano
@ 2022-09-26 18:43     ` Torsten Bögershausen
  2022-10-10 21:58       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-09-26 18:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m, Johannes.Schindelin

On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:

[]

> I think I spotted two remaining "bugs" that are left unfixed with
> this patch..
>
> There is "stat_width is -1 (auto)" case, which reads like so:
>
> 	if (options->stat_width == -1)
> 		width = term_columns() - strlen(line_prefix);
> 	else
> 		width = options->stat_width ? options->stat_width : 80;
>
> Here line_prefix eventually comes from the "git log --graph" and
> shows the colored graph segments on the same output line as the
> diffstat.
>
> This patch is probably not making anything worse, but by leaving it
> strlen(), it is likely overcounting the width of it.  We can
> presumably use utf8_strnwidth() that can optionally be told to be
> aware of the ANSI color sequence to count its width correctly to fix
> it.

[]
> This is the other remaining bug.

[]

> I think the remainder of the patch I did not quote looked quite
> straight-forward and correct.
>
> Thanks for working on this topic.

How should we proceed here ?
This patch fixes one, and only one, reported bug,
which is now verfied by a test case using unicode instead of ASCII.
Fixing additional bugs in diff.c (or anywhere else) had never been
part of this.

Things that needs more fixing and cleanups had been layed out as the
result of a review, that is good.

"git log --graph" was mentioned.
Do we have test cases, that test this ?
How easy are they converted into unicode instead of ASCII ?

I am not even sure, if I ever used "git log --graph" myself.
Digging further here, is somewhat out of my scope.
At least for the moment.






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-09-26 18:43     ` Torsten Bögershausen
@ 2022-10-10 21:58       ` Junio C Hamano
  2022-10-20 15:46         ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-10-10 21:58 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

> On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:
>
> []
>
>> I think I spotted two remaining "bugs" that are left unfixed with
>> this patch..
>> ...
> How should we proceed here ?
> This patch fixes one, and only one, reported bug,

But then two more were reported in the message you are responding
to, and they stem from the same underlying logic bug where byte
count and display columns are mixed interchangeably.

> "git log --graph" was mentioned.
> Do we have test cases, that test this ?
> How easy are they converted into unicode instead of ASCII ?

The graph stuff pushes your "start of line" to the right, making the
available screen real estate narrower.  I do not think in the
current code we need to worry about unicode vs ascii (IIRC, we stick
to ASCII graphics while drawing lines), but we do need to take into
account the fact that ANSI COLOR escape sequences have non-zero byte
count while occupying zero display columns.

The other bug about the code that finds which / to use to abbreviate
a long pathname on diffstat lines does involve byte vs column that
comes from unicode.  From the bug description in the message you are
responding to, if we have a directory name whose display columns and
byte count are significantly different, the end result by chopping
with the current code would end up wider than it should be, which
sounds like a recipe to cook up a test case to me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-10-10 21:58       ` Junio C Hamano
@ 2022-10-20 15:46         ` Torsten Bögershausen
  2022-10-20 17:43           ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-10-20 15:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m, Johannes.Schindelin

On Mon, Oct 10, 2022 at 02:58:26PM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > On Wed, Sep 14, 2022 at 09:40:04AM -0700, Junio C Hamano wrote:
> >
> > []
> >
> >> I think I spotted two remaining "bugs" that are left unfixed with
> >> this patch..
> >> ...
> > How should we proceed here ?
> > This patch fixes one, and only one, reported bug,
>
> But then two more were reported in the message you are responding
> to, and they stem from the same underlying logic bug where byte
> count and display columns are mixed interchangeably.
>
> > "git log --graph" was mentioned.
> > Do we have test cases, that test this ?
> > How easy are they converted into unicode instead of ASCII ?
>
> The graph stuff pushes your "start of line" to the right, making the
> available screen real estate narrower.  I do not think in the
> current code we need to worry about unicode vs ascii (IIRC, we stick
> to ASCII graphics while drawing lines), but we do need to take into
> account the fact that ANSI COLOR escape sequences have non-zero byte
> count while occupying zero display columns.
>
> The other bug about the code that finds which / to use to abbreviate
> a long pathname on diffstat lines does involve byte vs column that
> comes from unicode.  From the bug description in the message you are
> responding to, if we have a directory name whose display columns and
> byte count are significantly different, the end result by chopping
> with the current code would end up wider than it should be, which
> sounds like a recipe to cook up a test case to me.
>


I couldn't find how to trigger this code path.
The `git log --graph` help says:
--graph
    Draw a text-based graphical representation of the commit history
    on the left hand side of the output.
    This may cause extra lines to be printed in between commits,
    in order for the graph history to be drawn properly.
    Cannot be combined with --no-walk.

There is no indication about filenames or diffs in the
resultet output.
If someone has time and knowledge to cook up a test case,
that would help.

For the moment, I don't have enough spare time to spend on digging
how to write this test case, that's the sad part of the story.
And that is probably a good start, or, to be more strict,
an absolute precondition, if I need to change another single line
in diff.c

I still haven't understood why the current patch can not move forward
on its own ?
There is a bug report, patch, a test case that verifies the fix.

What more is needed ?
To fix all other bugs/issues/limitations in diff.c ?
If yes, they need to go in separate commits anyway, or do I miss
something ?

Can we dampen the expectations a little bit ?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-10-20 15:46         ` Torsten Bögershausen
@ 2022-10-20 17:43           ` Junio C Hamano
  2022-10-21 15:19             ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-10-20 17:43 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

> What more is needed ?
> To fix all other bugs/issues/limitations in diff.c ?
> If yes, they need to go in separate commits anyway, or do I miss
> something ?

At least leave some NEEDSWORK comment in the code that is known to
need more work, to remind others that the fix in the area of the
code is not done, perhaps.  Otherwise, much of the effort in the
review gets lost.

I offhand recall at least two (please go back to the original thread
to find the details of them).  One that measures the width of
long/path/name in bytes to determine where to start chomping in the
diffstat filename (because it still mixes display columns and
bytes), and the other one that measures the width of leading graph
segment in bytes without ignoring the ANSI color sequence, which
should be using utf8_strnwidth() but is using strlen().

https://lore.kernel.org/git/xmqqpmfx52qj.fsf@gitster.g/

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-10-20 17:43           ` Junio C Hamano
@ 2022-10-21 15:19             ` Torsten Bögershausen
  2022-10-21 21:59               ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Torsten Bögershausen @ 2022-10-21 15:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m, Johannes.Schindelin

On Thu, Oct 20, 2022 at 10:43:07AM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > What more is needed ?
> > To fix all other bugs/issues/limitations in diff.c ?
> > If yes, they need to go in separate commits anyway, or do I miss
> > something ?
>
> At least leave some NEEDSWORK comment in the code that is known to
> need more work, to remind others that the fix in the area of the
> code is not done, perhaps.  Otherwise, much of the effort in the
> review gets lost.
>
> I offhand recall at least two (please go back to the original thread
> to find the details of them).  One that measures the width of
> long/path/name in bytes to determine where to start chomping in the
> diffstat filename (because it still mixes display columns and
> bytes), and the other one that measures the width of leading graph
> segment in bytes without ignoring the ANSI color sequence, which
> should be using utf8_strnwidth() but is using strlen().
>
> https://lore.kernel.org/git/xmqqpmfx52qj.fsf@gitster.g/
>
> Thanks.

Good, good, good.
For the moment I don't have any spare time to spend on Git.
All your comments are noted, and I hope to get time to address them later.
If you kick out the branch from seen and the whats cooking list,
that would be fine with me.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-10-21 15:19             ` Torsten Bögershausen
@ 2022-10-21 21:59               ` Junio C Hamano
  2022-10-23 20:02                 ` Torsten Bögershausen
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-10-21 21:59 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, alexander.s.m, Johannes.Schindelin

Torsten Bögershausen <tboegi@web.de> writes:

> For the moment I don't have any spare time to spend on Git.
> All your comments are noted, and I hope to get time to address them later.
> If you kick out the branch from seen and the whats cooking list,
> that would be fine with me.

I'd rather not waste the efforts so far.  I am tempted to queue the
following on top or squash it in.

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
Subject: [PATCH] diff: leave NEEDWORK notes in show_stats() function

The previous step made an attempt to correctly compute display
columns allocated and padded different parts of diffstat output.
There are at least two known codepaths in the function that still
mixes up display widths and byte length that need to be fixed.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diff.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/diff.c b/diff.c
index 2751cae131..1d222d87b2 100644
--- a/diff.c
+++ b/diff.c
@@ -2675,6 +2675,11 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 	 * making the line longer than the maximum width.
 	 */
 
+	/*
+	 * NEEDSWORK: line_prefix is often used for "log --graph" output
+	 * and contains ANSI-colored string.  utf8_strnwidth() should be
+	 * used to correctly count the display width instead of strlen().
+	 */
 	if (options->stat_width == -1)
 		width = term_columns() - strlen(line_prefix);
 	else
@@ -2750,6 +2755,16 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			char *slash;
 			prefix = "...";
 			len -= 3;
+			/*
+			 * NEEDSWORK: (name_len - len) counts the display
+			 * width, which would be shorter than the byte
+			 * length of the corresponding substring.
+			 * Advancing "name" by that number of bytes does
+			 * *NOT* skip over that many columns, so it is
+			 * very likely that chomping the pathname at the
+			 * slash we will find starting from "name" will
+			 * leave the resulting string still too long.
+			 */
 			name += name_len - len;
 			slash = strchr(name, '/');
 			if (slash)
-- 
2.38.1-320-g901e6a2134


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
  2022-10-21 21:59               ` Junio C Hamano
@ 2022-10-23 20:02                 ` Torsten Bögershausen
  0 siblings, 0 replies; 42+ messages in thread
From: Torsten Bögershausen @ 2022-10-23 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, alexander.s.m, Johannes.Schindelin

On Fri, Oct 21, 2022 at 02:59:09PM -0700, Junio C Hamano wrote:
> Torsten Bögershausen <tboegi@web.de> writes:
>
> > For the moment I don't have any spare time to spend on Git.
> > All your comments are noted, and I hope to get time to address them later.
> > If you kick out the branch from seen and the whats cooking list,
> > that would be fine with me.
>
> I'd rather not waste the efforts so far.  I am tempted to queue the
> following on top or squash it in.
>
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 -----
> Subject: [PATCH] diff: leave NEEDWORK notes in show_stats() function
>
> The previous step made an attempt to correctly compute display
> columns allocated and padded different parts of diffstat output.
> There are at least two known codepaths in the function that still
> mixes up display widths and byte length that need to be fixed.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>  diff.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/diff.c b/diff.c
> index 2751cae131..1d222d87b2 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2675,6 +2675,11 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  	 * making the line longer than the maximum width.
>  	 */
>
> +	/*
> +	 * NEEDSWORK: line_prefix is often used for "log --graph" output
> +	 * and contains ANSI-colored string.  utf8_strnwidth() should be
> +	 * used to correctly count the display width instead of strlen().
> +	 */
>  	if (options->stat_width == -1)
>  		width = term_columns() - strlen(line_prefix);
>  	else
> @@ -2750,6 +2755,16 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			char *slash;
>  			prefix = "...";
>  			len -= 3;
> +			/*
> +			 * NEEDSWORK: (name_len - len) counts the display
> +			 * width, which would be shorter than the byte
> +			 * length of the corresponding substring.
> +			 * Advancing "name" by that number of bytes does
> +			 * *NOT* skip over that many columns, so it is
> +			 * very likely that chomping the pathname at the
> +			 * slash we will find starting from "name" will
> +			 * leave the resulting string still too long.
> +			 */
>  			name += name_len - len;
>  			slash = strchr(name, '/');
>  			if (slash)


That looks good to me -
my preferred version would be a patch on it's own on top.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-10-23 20:08 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-08-14 23:12   ` Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-05 10:13   ` Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).