[PATCH] t4205: don't rely on en_US.UTF-8 locale existing

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
@ 2013-07-03 20:18 John Keeping
  2013-07-03 20:40 ` Alexey Shumkin
  2013-07-03 21:41 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: John Keeping @ 2013-07-03 20:18 UTC (permalink / raw)
  To: git; +Cc: Alexey Shumkin, Junio C Hamano, John Keeping

My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
causes t4205 to fail by counting bytes instead of UTF-8 codepoints.

Instead of using sed for this, use Perl which behaves predictably
whatever locale is in use.

Signed-off-by: John Keeping <john@keeping.me.uk>
---
This patch is on top of 'as/log-output-encoding-in-user-format'.

 t/t4205-log-pretty-formats.sh | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
index 3cfb744..5864f5b 100755
--- a/t/t4205-log-pretty-formats.sh
+++ b/t/t4205-log-pretty-formats.sh
@@ -20,9 +20,7 @@ commit_msg () {
 		# cut string, replace cut part with two dots
 		# $2 - chars count from the beginning of the string
 		# $3 - "trailing" chars
-		# LC_ALL is set to make `sed` interpret "." as a UTF-8 char not a byte
-		# as it does with C locale
-		msg=$(echo $msg | LC_ALL=en_US.UTF-8 sed -e "s/^\(.\{$2\}\)$3/\1../")
+		msg=$(echo $msg | "$PERL_PATH" -CIO -pe "s/^(.{$2})$3/\1../")
 	fi
 	echo $msg
 }
@@ -205,7 +203,7 @@ test_expect_success 'left alignment formatting with ltrunc' "
 ..sage two
 ..sage one
 add bar  Z
-$(commit_msg "" "0" ".\{11\}")
+$(commit_msg "" "0" ".{11}")
 EOF
 	test_cmp expected actual
 "
@@ -218,7 +216,7 @@ test_expect_success 'left alignment formatting with mtrunc' "
 mess.. two
 mess.. one
 add bar  Z
-$(commit_msg "" "4" ".\{11\}")
+$(commit_msg "" "4" ".{11}")
 EOF
 	test_cmp expected actual
 "
-- 
1.8.3.1.747.g77f7d3a

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 20:18 [PATCH] t4205: don't rely on en_US.UTF-8 locale existing John Keeping
@ 2013-07-03 20:40 ` Alexey Shumkin
  2013-07-03 20:47   ` Alexey Shumkin
  2013-07-03 21:41 ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: Alexey Shumkin @ 2013-07-03 20:40 UTC (permalink / raw)
  To: John Keeping; +Cc: git, Junio C Hamano, Johannes Sixt

CC this to Johannes Sixt

On Wed, Jul 03, 2013 at 09:18:08PM +0100, John Keeping wrote:
> My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
> 
> Instead of using sed for this, use Perl which behaves predictably
> whatever locale is in use.
> 
> Signed-off-by: John Keeping <john@keeping.me.uk>
> ---
> This patch is on top of 'as/log-output-encoding-in-user-format'.
> 
>  t/t4205-log-pretty-formats.sh | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
> index 3cfb744..5864f5b 100755
> --- a/t/t4205-log-pretty-formats.sh
> +++ b/t/t4205-log-pretty-formats.sh
> @@ -20,9 +20,7 @@ commit_msg () {
>  		# cut string, replace cut part with two dots
>  		# $2 - chars count from the beginning of the string
>  		# $3 - "trailing" chars
> -		# LC_ALL is set to make `sed` interpret "." as a UTF-8 char not a byte
> -		# as it does with C locale
> -		msg=$(echo $msg | LC_ALL=en_US.UTF-8 sed -e "s/^\(.\{$2\}\)$3/\1../")
> +		msg=$(echo $msg | "$PERL_PATH" -CIO -pe "s/^(.{$2})$3/\1../")
>  	fi
>  	echo $msg
>  }
> @@ -205,7 +203,7 @@ test_expect_success 'left alignment formatting with ltrunc' "
>  ..sage two
>  ..sage one
>  add bar  Z
> -$(commit_msg "" "0" ".\{11\}")
> +$(commit_msg "" "0" ".{11}")
>  EOF
>  	test_cmp expected actual
>  "
> @@ -218,7 +216,7 @@ test_expect_success 'left alignment formatting with mtrunc' "
>  mess.. two
>  mess.. one
>  add bar  Z
> -$(commit_msg "" "4" ".\{11\}")
> +$(commit_msg "" "4" ".{11}")
>  EOF
>  	test_cmp expected actual
>  "
> -- 
> 1.8.3.1.747.g77f7d3a
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 20:40 ` Alexey Shumkin
@ 2013-07-03 20:47   ` Alexey Shumkin
  0 siblings, 0 replies; 7+ messages in thread
From: Alexey Shumkin @ 2013-07-03 20:47 UTC (permalink / raw)
  To: John Keeping; +Cc: git, Junio C Hamano, Johannes Sixt

http://thread.gmane.org/gmane.comp.version-control.git/229291

this is why CCed
> CC this to Johannes Sixt
> 
> On Wed, Jul 03, 2013 at 09:18:08PM +0100, John Keeping wrote:
> > My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> > causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
> > 
> > Instead of using sed for this, use Perl which behaves predictably
> > whatever locale is in use.
> > 
> > Signed-off-by: John Keeping <john@keeping.me.uk>
> > ---
> > This patch is on top of 'as/log-output-encoding-in-user-format'.
> > 
> >  t/t4205-log-pretty-formats.sh | 8 +++-----
> >  1 file changed, 3 insertions(+), 5 deletions(-)
> > 
> > diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
> > index 3cfb744..5864f5b 100755
> > --- a/t/t4205-log-pretty-formats.sh
> > +++ b/t/t4205-log-pretty-formats.sh
> > @@ -20,9 +20,7 @@ commit_msg () {
> >  		# cut string, replace cut part with two dots
> >  		# $2 - chars count from the beginning of the string
> >  		# $3 - "trailing" chars
> > -		# LC_ALL is set to make `sed` interpret "." as a UTF-8 char not a byte
> > -		# as it does with C locale
> > -		msg=$(echo $msg | LC_ALL=en_US.UTF-8 sed -e "s/^\(.\{$2\}\)$3/\1../")
> > +		msg=$(echo $msg | "$PERL_PATH" -CIO -pe "s/^(.{$2})$3/\1../")
> >  	fi
> >  	echo $msg
> >  }
> > @@ -205,7 +203,7 @@ test_expect_success 'left alignment formatting with ltrunc' "
> >  ..sage two
> >  ..sage one
> >  add bar  Z
> > -$(commit_msg "" "0" ".\{11\}")
> > +$(commit_msg "" "0" ".{11}")
> >  EOF
> >  	test_cmp expected actual
> >  "
> > @@ -218,7 +216,7 @@ test_expect_success 'left alignment formatting with mtrunc' "
> >  mess.. two
> >  mess.. one
> >  add bar  Z
> > -$(commit_msg "" "4" ".\{11\}")
> > +$(commit_msg "" "4" ".{11}")
> >  EOF
> >  	test_cmp expected actual
> >  "
> > -- 
> > 1.8.3.1.747.g77f7d3a
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 20:18 [PATCH] t4205: don't rely on en_US.UTF-8 locale existing John Keeping
  2013-07-03 20:40 ` Alexey Shumkin
@ 2013-07-03 21:41 ` Junio C Hamano
  2013-07-03 21:53   ` John Keeping
  2013-07-03 22:25   ` Alexey Shumkin
  1 sibling, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2013-07-03 21:41 UTC (permalink / raw)
  To: John Keeping; +Cc: git, Alexey Shumkin

John Keeping <john@keeping.me.uk> writes:

> My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
>
> Instead of using sed for this, use Perl which behaves predictably
> whatever locale is in use.
>
> Signed-off-by: John Keeping <john@keeping.me.uk>
> ---
> This patch is on top of 'as/log-output-encoding-in-user-format'.

Thanks.  I think Alexey is going to send incremental updates to the
topic so I won't interfere by applying this patch on top of the
version I have in my tree.

But I do agree that using Perl may be a workable solution.

An alternative might be not to use this cryptic 3-arg form of
commit_msg at all.  They are used only for these three:

	$(commit_msg "" "8" "..*$")
	$(commit_msg "" "0" ".\{11\}")
	$(commit_msg "" "4" ".\{11\}")

I somehow find them simply not readable, in order to figure out what
is going on.

Just using three variables to hold what are expected would be far
more portable and readable.

# "anfänglich" whatever it means.
sample_utf8_part=$(printf "anf\303\244ng")

commit_msg () {
	msg="initial. ${sample_utf8_part}lich";
	if test -n "$1"
	then
		echo "$msg" | iconv -f utf-8 -t "$1"
	else
		echo "$msg"
        fi
}

And then instead of writing in the expected test output.

	$(commit_msg "" "8" "..*$")
	$(commit_msg "" "0" ".\{11\}")
	$(commit_msg "" "4" ".\{11\}")

we can just say

	initial...
        ..an${sample_utf8_part}lich
	init..lich

It is no worse than those cryptic 0, 4, 8 and 11 magic numbers we
see in the test, no?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 21:41 ` Junio C Hamano
@ 2013-07-03 21:53   ` John Keeping
  2013-07-03 22:27     ` Alexey Shumkin
  2013-07-03 22:25   ` Alexey Shumkin
  1 sibling, 1 reply; 7+ messages in thread
From: John Keeping @ 2013-07-03 21:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Alexey Shumkin

On Wed, Jul 03, 2013 at 02:41:06PM -0700, Junio C Hamano wrote:
> John Keeping <john@keeping.me.uk> writes:
> 
> > My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> > causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
> >
> > Instead of using sed for this, use Perl which behaves predictably
> > whatever locale is in use.
> >
> > Signed-off-by: John Keeping <john@keeping.me.uk>
> > ---
> > This patch is on top of 'as/log-output-encoding-in-user-format'.
> 
> Thanks.  I think Alexey is going to send incremental updates to the
> topic so I won't interfere by applying this patch on top of the
> version I have in my tree.
> 
> But I do agree that using Perl may be a workable solution.
> 
> An alternative might be not to use this cryptic 3-arg form of
> commit_msg at all.  They are used only for these three:
> 
> 	$(commit_msg "" "8" "..*$")
> 	$(commit_msg "" "0" ".\{11\}")
> 	$(commit_msg "" "4" ".\{11\}")
> 
> I somehow find them simply not readable, in order to figure out what
> is going on.
> 
> Just using three variables to hold what are expected would be far
> more portable and readable.
> 
> # "anfänglich" whatever it means.
> sample_utf8_part=$(printf "anf\303\244ng")
> 
> commit_msg () {
> 	msg="initial. ${sample_utf8_part}lich";
> 	if test -n "$1"
> 	then
> 		echo "$msg" | iconv -f utf-8 -t "$1"
> 	else
> 		echo "$msg"
>         fi
> }
> 
> And then instead of writing in the expected test output.
> 
> 	$(commit_msg "" "8" "..*$")
> 	$(commit_msg "" "0" ".\{11\}")
> 	$(commit_msg "" "4" ".\{11\}")
> 
> we can just say
> 
> 	initial...
>         ..an${sample_utf8_part}lich
> 	init..lich
> 
> It is no worse than those cryptic 0, 4, 8 and 11 magic numbers we
> see in the test, no?

That's probably better since we don't need to rely on some other tool
getting it right.

Alexey, will you incorporate this change in your incremental updates?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 21:41 ` Junio C Hamano
  2013-07-03 21:53   ` John Keeping
@ 2013-07-03 22:25   ` Alexey Shumkin
  1 sibling, 0 replies; 7+ messages in thread
From: Alexey Shumkin @ 2013-07-03 22:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: John Keeping, git, Johannes Sixt

On Wed, Jul 03, 2013 at 02:41:06PM -0700, Junio C Hamano wrote:
> John Keeping <john@keeping.me.uk> writes:
> 
> > My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> > causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
> >
> > Instead of using sed for this, use Perl which behaves predictably
> > whatever locale is in use.
> >
> > Signed-off-by: John Keeping <john@keeping.me.uk>
> > ---
> > This patch is on top of 'as/log-output-encoding-in-user-format'.
> 
> Thanks.  I think Alexey is going to send incremental updates to the
> topic so I won't interfere by applying this patch on top of the
> version I have in my tree.
> 
> But I do agree that using Perl may be a workable solution.
> 
> An alternative might be not to use this cryptic 3-arg form of
> commit_msg at all.  They are used only for these three:
> 
> 	$(commit_msg "" "8" "..*$")
> 	$(commit_msg "" "0" ".\{11\}")
> 	$(commit_msg "" "4" ".\{11\}")
> 
> I somehow find them simply not readable, in order to figure out what
> is going on.
> 
> Just using three variables to hold what are expected would be far
> more portable and readable.
> 
> # "anfänglich" whatever it means.
> sample_utf8_part=$(printf "anf\303\244ng")
> 
> commit_msg () {
> 	msg="initial. ${sample_utf8_part}lich";
> 	if test -n "$1"
> 	then
> 		echo "$msg" | iconv -f utf-8 -t "$1"
> 	else
> 		echo "$msg"
>         fi
> }
> 
> And then instead of writing in the expected test output.
> 
> 	$(commit_msg "" "8" "..*$")
> 	$(commit_msg "" "0" ".\{11\}")
> 	$(commit_msg "" "4" ".\{11\}")
> 
> we can just say
> 
> 	initial...
>         ..an${sample_utf8_part}lich
> 	init..lich
> 
> It is no worse than those cryptic 0, 4, 8 and 11 magic numbers we
> see in the test, no?
Yep!
when I was thinking about Johannes's suggestions, I finally came to the decision
alike yours.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] t4205: don't rely on en_US.UTF-8 locale existing
  2013-07-03 21:53   ` John Keeping
@ 2013-07-03 22:27     ` Alexey Shumkin
  0 siblings, 0 replies; 7+ messages in thread
From: Alexey Shumkin @ 2013-07-03 22:27 UTC (permalink / raw)
  To: John Keeping; +Cc: Junio C Hamano, git, Johannes Sixt

On Wed, Jul 03, 2013 at 10:53:03PM +0100, John Keeping wrote:
> On Wed, Jul 03, 2013 at 02:41:06PM -0700, Junio C Hamano wrote:
> > John Keeping <john@keeping.me.uk> writes:
> > 
> > > My system doesn't have the en_US.UTF-8 locale (or plain en_US), which
> > > causes t4205 to fail by counting bytes instead of UTF-8 codepoints.
> > >
> > > Instead of using sed for this, use Perl which behaves predictably
> > > whatever locale is in use.
> > >
> > > Signed-off-by: John Keeping <john@keeping.me.uk>
> > > ---
> > > This patch is on top of 'as/log-output-encoding-in-user-format'.
> > 
> > Thanks.  I think Alexey is going to send incremental updates to the
> > topic so I won't interfere by applying this patch on top of the
> > version I have in my tree.
> > 
> > But I do agree that using Perl may be a workable solution.
> > 
> > An alternative might be not to use this cryptic 3-arg form of
> > commit_msg at all.  They are used only for these three:
> > 
> > 	$(commit_msg "" "8" "..*$")
> > 	$(commit_msg "" "0" ".\{11\}")
> > 	$(commit_msg "" "4" ".\{11\}")
> > 
> > I somehow find them simply not readable, in order to figure out what
> > is going on.
> > 
> > Just using three variables to hold what are expected would be far
> > more portable and readable.
> > 
> > # "anfänglich" whatever it means.
> > sample_utf8_part=$(printf "anf\303\244ng")
> > 
> > commit_msg () {
> > 	msg="initial. ${sample_utf8_part}lich";
> > 	if test -n "$1"
> > 	then
> > 		echo "$msg" | iconv -f utf-8 -t "$1"
> > 	else
> > 		echo "$msg"
> >         fi
> > }
> > 
> > And then instead of writing in the expected test output.
> > 
> > 	$(commit_msg "" "8" "..*$")
> > 	$(commit_msg "" "0" ".\{11\}")
> > 	$(commit_msg "" "4" ".\{11\}")
> > 
> > we can just say
> > 
> > 	initial...
> >         ..an${sample_utf8_part}lich
> > 	init..lich
> > 
> > It is no worse than those cryptic 0, 4, 8 and 11 magic numbers we
> > see in the test, no?
> 
> That's probably better since we don't need to rely on some other tool
> getting it right.
> 
> Alexey, will you incorporate this change in your incremental updates?
Yes, of course!
Thank you for your additions

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-07-03 22:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-03 20:18 [PATCH] t4205: don't rely on en_US.UTF-8 locale existing John Keeping
2013-07-03 20:40 ` Alexey Shumkin
2013-07-03 20:47   ` Alexey Shumkin
2013-07-03 21:41 ` Junio C Hamano
2013-07-03 21:53   ` John Keeping
2013-07-03 22:27     ` Alexey Shumkin
2013-07-03 22:25   ` Alexey Shumkin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).