git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: tboegi@web.de
To: git@vger.kernel.org, alexander.s.m@gmail.com
Cc: "Torsten Bögershausen" <tboegi@web.de>
Subject: [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth()
Date: Sun, 14 Aug 2022 15:35:31 +0200	[thread overview]
Message-ID: <20220814133531.16952-1-tboegi@web.de> (raw)
In-Reply-To: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com>

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

$ git log --stat

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exist for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem: code like this
strbuf_addf(&out, "%-*s", len, name);

(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count, if name is UTF-8 encoded.

We could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The choosen solution is to split code in diff.c like this

strbuf_addf(&out, "%-*s", len, name);

into 2 calls, like this:

strbuf_addf(&out, "%s", name);
if (len > utf8_strwidth(name))
    strbuf_addchars(&out, ' ', len - utf8_strwidth(name));

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/diff.c b/diff.c
index 974626a621..7fb254c545 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2734,6 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
+		size_t num_padding_spaces = 0;
 		int name_len;

 		if (!file->is_interesting && (added + deleted == 0))
@@ -2743,7 +2744,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
@@ -2753,10 +2754,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		if (len > utf8_strwidth(name))
+			num_padding_spaces = len - utf8_strwidth(name);

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addf(&out, "| %*s", number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2781,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addstr(&out, "| Unmerged\n");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,8 +2810,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
+		strbuf_addf(&out, " %s%s ", prefix,  name);
+		if (num_padding_spaces)
+			strbuf_addchars(&out, ' ', num_padding_spaces);
+		strbuf_addf(&out, "| %*"PRIuMAX"%s",
 			number_width, added + deleted,
 			added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
--
2.34.0


  parent reply	other threads:[~2022-08-14 13:40 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` tboegi [this message]
2022-08-14 23:12   ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-05 10:13   ` Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220814133531.16952-1-tboegi@web.de \
    --to=tboegi@web.de \
    --cc=alexander.s.m@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).