Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Torsten Bögershausen" <tboegi@web.de>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: git@vger.kernel.org, alexander.s.m@gmail.com
Subject: Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()
Date: Mon, 29 Aug 2022 19:54:25 +0200	[thread overview]
Message-ID: <20220829175425.cmbwtqpxrq4ppnnk@tb-raspi4> (raw)
In-Reply-To: <0q921n79-sr17-2794-83r0-r59rnqq03pp2@tzk.qr>

On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> Hi Torsten,
> >
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into something like this:
> >
> > size_t num_padding_spaces = 0;
> > // [snip]
> > if (len > utf8_strwidth(name))
> >     num_padding_spaces = len - utf8_strwidth(name);
> > strbuf_addf(&out, "%s", name);
> > if (num_padding_spaces)
> >     strbuf_addchars(&out, ' ', num_padding_spaces);
>
> ... this sounds like it would benefit from beinv refactored into a
> separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> readability as well as for self-documentation.

Yes, but:
All (tm) strbuf() functions use an unsigned size_t, and are not
tolerant against passing 0 as "do nothing".
A nicer solution (for this patch) could be a change like this:
Instead of

void strbuf_addchars(struct strbuf *sb, int c, size_t n)
{
        strbuf_grow(sb, n);
	memset(sb->buf + sb->len, c, n);
	strbuf_setlen(sb, sb->len + n);
}

We would find:
void strbuf_addchars(struct strbuf *sb, int c, ssize_t n)
{
        if (n <= 0)
	       return;
        strbuf_grow(sb, (size_t)n);
	memset(sb->buf + sb->len, c, (size_t)n);
	strbuf_setlen(sb, sb->len + (size_t)n);
}

I couldn't convince myself to do so.
Since it is mainly diff.c that needs this adjustment/padding of strings,
I coulnd't convince myself to write another function in strbuf.c


>
> Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> _twice_ and why we do not assign the result to a variable called `width`
> and then have a conditional like
>
> 	if (width < len) /* pad to `len` columns */
> 		strbuf_addchars(&out, ' ' , len - width);
>
> instead. That would sound more logical to me.

This is caused by the logic in diff.c:
  /*
   * Find the longest filename and max number of changes
   */
   for (i = 0; (i < count) && (i < data->nr); i++) {
       struct diffstat_file *file = data->files[i];
       [snip]
       len = utf8_strwidth(file->print_name);
       if (max_width < len)
          max_width = len;
// and later
    /*
     * From here name_width is the width of the name area,
     * and graph_width is the width of the graph area.
     * max_change is used to scale graph properly.
     */
    for (i = 0; i < count; i++) {
    /*
     * "scale" the filename
     */
     // TB: Which means either shortening it with ...
     // Or padding it, if needed, and here we need
     // another
     name_len = utf8_strwidth(name);

>
> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
> so different from changing `strbuf_addf(...)`, I would prefer to see them
> split into two patches.

Hm, that is a possiblity. Seems to ease the burden for reviewers.

>
> >
> > Tests:
> > Two things need to be tested:
> > - The calculation of the maximum width
> > - The calculation of num_padding_spaces
> >
> > The name "textfile" is changed into "textfilë", both have a width of 8.
> > If strlen() was used, to get the maximum width, the shorter "binfile" would
> > have been mis-aligned:
> >  binfile   |  [snip]
> >  textfilë | [snip]
> >
> > If only "binfile" would be renamed into "binfilë":
> >  binfilë |  [snip]
> >  textfile | [snip]
> >
> > In order to verify that the width is calculated correctly everywhere,
> > "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> > "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> > and the updated t4012-diff-binary.sh checks the correct aligment:
> >  binfïlë  | [snip]
> >  textfilë | [snip]
>
> I wonder whether you can change only _one_ name and still verify the
> correctness. When you make two changes at the same time, it is always
> possible for one change to "cancel out" the other one, and therefore it is
> harder to reason about the correctness of your patch.

Nee, I have a hard time to see how a +/- 1 can "cancel out" a +/- 2.
But I may improve the commit message, to make that more clear.

>
> Better keep it simple and change only one instance (personally,
> I would have changed two letters in the longer one).

That is certainly doable.


>
> >
> > Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> > Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> > ---
> >  diff.c                 | 37 +++++++++++++++++++++++--------------
> >  t/t4012-diff-binary.sh | 14 +++++++-------
> >  2 files changed, 30 insertions(+), 21 deletions(-)
> >
> > diff --git a/diff.c b/diff.c
> > index 974626a621..cf38e1dc88 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
> >  static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  {
> >  	int i, len, add, del, adds = 0, dels = 0;
> > -	uintmax_t max_change = 0, max_len = 0;
> > +	uintmax_t max_change = 0, max_width = 0;
>
> Why rename `max_len`, but not `len`? I would have expected (and agreed
> with seeing) `len` to be renamed to `width`, too.

That is a valid point.
There is, however, already a variable called "width".
And renaming this one into a new one as well ?

>
> >  	int total_files = data->nr, count;
> >  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
> >  	const char *reset, *add_c, *del_c;
> > @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  			continue;
> >  		}
> >  		fill_print_name(file);
> > -		len = strlen(file->print_name);
> > -		if (max_len < len)
> > -			max_len = len;
> > +		len = utf8_strwidth(file->print_name);
> > +		if (max_width < len)
> > +			max_width = len;
> >
> >  		if (file->is_unmerged) {
> >  			/* "Unmerged" is 8 characters */
> > @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >
> >  	/*
> >  	 * We have width = stat_width or term_columns() columns total.
> > -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> > +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
> >  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
> >  	 * We also need 1 for " " and 4 + decimal_width(max_change)
> >  	 * for " | NNNN " and one the empty column at the end, altogether
> > @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  		graph_width = options->stat_graph_width;
> >
> >  	name_width = (options->stat_name_width > 0 &&
> > -		      options->stat_name_width < max_len) ?
> > -		options->stat_name_width : max_len;
> > +		      options->stat_name_width < max_width) ?
> > +		options->stat_name_width : max_width;
>
> It is a bit sad that the diff lines regarding the renamed variable drown
> out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
> result is nicer.
>
> Thank you for working on this!
> Dscho

Thanks so much for the review - let's see if I can make a better patch
the next days (better say weeks)

next prev parent reply	other threads:[~2022-08-29 17:54 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-08-14 23:12   ` Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen [this message]
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-05 10:13   ` Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220829175425.cmbwtqpxrq4ppnnk@tb-raspi4 \
    --to=tboegi@web.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=alexander.s.m@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).