git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: tboegi@web.de
Cc: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de
Subject: Re: [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth()
Date: Wed, 14 Sep 2022 09:40:04 -0700	[thread overview]
Message-ID: <xmqqpmfx52qj.fsf@gitster.g> (raw)
In-Reply-To: <20220914151333.3309-1-tboegi@web.de> (tboegi@web.de's message of "Wed, 14 Sep 2022 17:13:33 +0200")

> The basic idea is to change code in diff.c like this
> strbuf_addf(&out, "%-*s", len, name);
>
> into something like this:
> int padding = len - utf8_strwidth(name);
> if (padding < 0)
> 	padding = 0;
> strbuf_addf(&out, " %s%*s", name, padding, "");
> ...
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c                 | 27 ++++++++++++++++-----------
>  t/t4012-diff-binary.sh | 14 +++++++-------
>  2 files changed, 23 insertions(+), 18 deletions(-)

> diff --git a/diff.c b/diff.c
> index 974626a621..35b9da90fe 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;

The changes in this patch are isolated to the show_stats() helper
function, and looking at use of "len", "max_len", and "name_len", it
may be a good clean-up to make them based on "width".  A bit of care
needs to be taken because the way existing variables are used is a
bit convoluted at times:

 - "width" already exists.  "len" and "max_len" are used in an early
   loop to eventually derive "name_width".

 - "len" is later used in the loop for each pathname to hold a copy
   of "name_width" that can locally be adjusted to accomodate "..."
   abbreviation/munging of the pathname.

 - "name_width" already exists in addition to "name_len".  The
   former holds how many display columns a pathname can occupy in the
   diffstat output, while the latter is used in a loop to hold the
   display columns of the pathname each iteration is looking at, to
   see if it is wider than "name_width" (in which case there is the
   "..." abbreviation that is NOT UTF-8 aware even after this patch)
   or narrower (in which case we'd do the padding).  As the existing
   "name_width" is how we want to name our variables (i.e. the width
   allocated for names), the "name_len", if we were to follow "len
   misleads us to think it is byte length, so use width instead",
   would need to become something like "this_name_width" (i.e. the
   width of the name of the pathname in this iteration of the loop).

But I am OK to do WITHOUT any such renaming, and I do not want to
see such renaming in the same patch ("preliminary clean-up" or
"clean-up after the dust settles" are good, thoguh).  Counting
display columns correctly is more important.

I think I spotted two remaining "bugs" that are left unfixed with
this patch..

There is "stat_width is -1 (auto)" case, which reads like so:

	if (options->stat_width == -1)
		width = term_columns() - strlen(line_prefix);
	else
		width = options->stat_width ? options->stat_width : 80;

Here line_prefix eventually comes from the "git log --graph" and
shows the colored graph segments on the same output line as the
diffstat.

This patch is probably not making anything worse, but by leaving it
strlen(), it is likely overcounting the width of it.  We can
presumably use utf8_strnwidth() that can optionally be told to be
aware of the ANSI color sequence to count its width correctly to fix
it.

> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";

The code around here between this and the next hunk needs cleaning up.

		if (name_width < name_len) {
			char *slash;
			prefix = "...";
			len -= 3;
			name += name_len - len;
			slash = strchr(name, '/');
			if (slash)
				name = slash;
		}

We found the display columns of the current item "name_len" is wider
than what we allocated "name_width" for the names.  We are going to
chomp as many pathname components from the front as needed, at '/'
boundary, to turn "aaaa/bbbb/cccc/dddd.txt" into ".../cccc/dddd.txt"
to make the result fit.

But the way to ensure that '/' before "cccc" is the one we want (as
opposed to the one before "bbbb" or "dddd") is initially based on
columns (i.e. because we want "...", we first subtract 3 from len
which is a local synonym for name_width and then subtract that from
"name_len", i.e. we ask: "how many display columns do we have in the
current pathname that is excess of what we can afford to allocate?"
The intention is to skip that many columns from the beginning of "name"
and start looking for '/' from there.

But we move "name" pointer by that many *bytes*!  We end up scanning
starting at a middle of a character.  What we look for is '/' and when
we find it we know the byte is a standalone character, so we do not
chomp a character in the middle, but it is very likely that we find
a slash that leaves the remaining string still too long, because
skipping say 2 columns may need skipping 4 bytes, but we only
skipped the same number of bytes as the number of columns we need to
skip.

This is the other remaining bug.

I think this needs to become a loop that loops while the width of
the current suffix is still wider than we can afford, discarding one
leading pathname component at a time at '/', measuring the resulting
width, or something like that.  Something along the lines of this
not-even-compile-tested sketch:

        /* we assume strlen(prefix) == utf8_strwidth(prefix) */
	while (name_width < utf8_strwidth(name) + strlen(prefix)) {
		char *slash;
		if (name[0] == '/')
			name++;
                slash = strchr(name);
		if (slash)
			name = slash;
		else
			break; /* Give Up */
		prefix = "...";
	}

> @@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> +		padding = len - utf8_strwidth(name);
> +		if (padding < 0)
> +			padding = 0;

Here "len" cannot become "name_length" because the former has to be
narrower than the latter by the display width of "prefix"
(i.e. "..."), so while this looks "strange", it is correct.

I think the remainder of the patch I did not quote looked quite
straight-forward and correct.

Thanks for working on this topic.

  reply	other threads:[~2022-09-14 16:40 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-08-14 23:12   ` Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-05 10:13   ` Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano [this message]
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqpmfx52qj.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=alexander.s.m@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).