RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
@ 2021-08-13 14:22 Randall S. Becker
  2021-08-13 15:00 ` Junio C Hamano
  2021-08-13 16:06 ` Jeff King
  0 siblings, 2 replies; 17+ messages in thread
From: Randall S. Becker @ 2021-08-13 14:22 UTC (permalink / raw)
  To: 'Junio C Hamano', git

On August 11, 2021 6:31 PM, I wrote:
>To: 'Junio C Hamano' <gitster@pobox.com>; git@vger.kernel.org
>Subject: RE: [ANNOUNCE] Git v2.33.0-rc1 (Build/Test Report)
>
>On August 6, 2021 8:07 PM, Junio C Hamano wrote:
>>
>>A release candidate Git v2.33.0-rc1 is now available for testing at the
>>usual places.  It is comprised of 396 non-merge commits since v2.32.0, contributed by 63 people, 19 of which are new faces [*].
>
>Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop ia64 platform, and t5563.8 hung on the x86 platform. These did not
>hang at rc0 or in 2.31.0. We have seen (and reported) these before and they do appear transient. I have been meaning to dig deeper into
>both of these tests to figure out why and what the conditions are, but there are many moving parts on the platform (it is MPP). It probably
>would take some brainstorming - I suspect there is something relating to timing in the message system that is causing this heartburn, but
>the features under test do not appear to be commonly used.

2.32.0-rc2 passes on the NonStop x86 platform. ia64 is slower and still running. The test failure is t9001 because there is no sendmail on the platform. Is there a suitable dependency that I can use add to bypass this test?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-13 14:22 [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report) Randall S. Becker
@ 2021-08-13 15:00 ` Junio C Hamano
  2021-08-13 16:06 ` Jeff King
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2021-08-13 15:00 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: git

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

> 2.32.0-rc2 passes on the NonStop x86 platform. ia64 is slower and
> still running. The test failure is t9001 because there is no
> sendmail on the platform. Is there a suitable dependency that I
> can use add to bypass this test?

I do not think of any offhand.  I usually use GIT_SKIP_TESTS
environment variable for things like this.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-13 14:22 [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report) Randall S. Becker
  2021-08-13 15:00 ` Junio C Hamano
@ 2021-08-13 16:06 ` Jeff King
  2021-08-16 18:08   ` Randall S. Becker
  2021-08-16 18:31   ` Randall S. Becker
  1 sibling, 2 replies; 17+ messages in thread
From: Jeff King @ 2021-08-13 16:06 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Fri, Aug 13, 2021 at 10:22:19AM -0400, Randall S. Becker wrote:

> >Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop
> >ia64 platform, and t5563.8 hung on the x86 platform. These did not

We don't seem to have a t5563. Do you mean t5562?

> 2.32.0-rc2 passes on the NonStop x86 platform. ia64 is slower and
> still running. The test failure is t9001 because there is no sendmail
> on the platform. Is there a suitable dependency that I can use add to
> bypass this test?

Hmm, we shouldn't be depending on platform sendmail for the tests (after
all, we do not want to actually send mail!). What does the failure look
like?

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-13 16:06 ` Jeff King
@ 2021-08-16 18:08   ` Randall S. Becker
  2021-08-16 18:35     ` Jeff King
  2021-08-16 18:31   ` Randall S. Becker
  1 sibling, 1 reply; 17+ messages in thread
From: Randall S. Becker @ 2021-08-16 18:08 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 13, 2021 12:07 PM, Jeff King wrote:
>On Fri, Aug 13, 2021 at 10:22:19AM -0400, Randall S. Becker wrote:
>
>> >Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop
>> >ia64 platform, and t5563.8 hung on the x86 platform. These did not
>
>We don't seem to have a t5563. Do you mean t5562?
Yes.

>
>> 2.32.0-rc2 passes on the NonStop x86 platform. ia64 is slower and
>> still running. The test failure is t9001 because there is no sendmail
>> on the platform. Is there a suitable dependency that I can use add to
>> bypass this test?
>
>Hmm, we shouldn't be depending on platform sendmail for the tests (after all, we do not want to actually send mail!). What does the
>failure look like?

This looks strange. There is no signal 34 defined by the platform.

test_must_fail: died by signal 34: git send-email --from=Example <nobody@example.com> --to=nobody@example.com --smtp-server=/home/git/git/t/trash directory.t9001-send-email/fake.sendmail --transfer-encoding=8bit 0001-Second.patch longline.patch
not ok 33 - reject long lines
#
#               z8=zzzzzzzz &&
#               z64=$z8$z8$z8$z8$z8$z8$z8$z8 &&
#               z512=$z64$z64$z64$z64$z64$z64$z64$z64 &&
#               clean_fake_sendmail &&
#               cp $patches longline.patch &&
#               cat >>longline.patch <<-EOF &&
#               $z512$z512
#               not a long line
#               $z512$z512
#               EOF
#               test_must_fail git send-email \
#                       --from="Example <nobody@example.com>" \
#                       --to=nobody@example.com \
#                       --smtp-server="$(pwd)/fake.sendmail" \
#                       --transfer-encoding=8bit \
#                       $patches longline.patch \
#                       2>actual &&
#               cat >expect <<-\EOF &&
#               fatal: longline.patch:35 is longer than 998 characters
#               warning: no patches were sent
#               EOF
#               test_cmp expect actual
#
/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-13 16:06 ` Jeff King
  2021-08-16 18:08   ` Randall S. Becker
@ 2021-08-16 18:31   ` Randall S. Becker
  2021-08-16 18:36     ` Jeff King
  1 sibling, 1 reply; 17+ messages in thread
From: Randall S. Becker @ 2021-08-16 18:31 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 2:09 PM, I wrote:
>To: 'Jeff King' <peff@peff.net>
>Cc: 'Junio C Hamano' <gitster@pobox.com>; 'git@vger.kernel.org' <git@vger.kernel.org>
>Subject: RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
>
>On August 13, 2021 12:07 PM, Jeff King wrote:
>>On Fri, Aug 13, 2021 at 10:22:19AM -0400, Randall S. Becker wrote:
>>
>>> >Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop
>>> >ia64 platform, and t5563.8 hung on the x86 platform. These did not
>>
>>We don't seem to have a t5563. Do you mean t5562?
>Yes.
>
>>
>>> 2.32.0-rc2 passes on the NonStop x86 platform. ia64 is slower and
>>> still running. The test failure is t9001 because there is no sendmail
>>> on the platform. Is there a suitable dependency that I can use add to
>>> bypass this test?
>>
>>Hmm, we shouldn't be depending on platform sendmail for the tests
>>(after all, we do not want to actually send mail!). What does the failure look like?
>
>This looks strange. There is no signal 34 defined by the platform.
>
>test_must_fail: died by signal 34: git send-email --from=Example <nobody@example.com> --to=nobody@example.com --smtp-
>server=/home/git/git/t/trash directory.t9001-send-email/fake.sendmail --transfer-encoding=8bit 0001-Second.patch longline.patch not ok
>33 - reject long lines #
>#               z8=zzzzzzzz &&
>#               z64=$z8$z8$z8$z8$z8$z8$z8$z8 &&
>#               z512=$z64$z64$z64$z64$z64$z64$z64$z64 &&
>#               clean_fake_sendmail &&
>#               cp $patches longline.patch &&
>#               cat >>longline.patch <<-EOF &&
>#               $z512$z512
>#               not a long line
>#               $z512$z512
>#               EOF
>#               test_must_fail git send-email \
>#                       --from="Example <nobody@example.com>" \
>#                       --to=nobody@example.com \
>#                       --smtp-server="$(pwd)/fake.sendmail" \
>#                       --transfer-encoding=8bit \
>#                       $patches longline.patch \
>#                       2>actual &&
>#               cat >expect <<-\EOF &&
>#               fatal: longline.patch:35 is longer than 998 characters
>#               warning: no patches were sent
>#               EOF
>#               test_cmp expect actual
>#
>/

I should point out that all 6 failures in t9001 have the same characteristic - signal 34.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 18:08   ` Randall S. Becker
@ 2021-08-16 18:35     ` Jeff King
  2021-08-16 18:54       ` Randall S. Becker
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2021-08-16 18:35 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 02:08:38PM -0400, Randall S. Becker wrote:

> On August 13, 2021 12:07 PM, Jeff King wrote:
> >On Fri, Aug 13, 2021 at 10:22:19AM -0400, Randall S. Becker wrote:
> >
> >> >Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop
> >> >ia64 platform, and t5563.8 hung on the x86 platform. These did not
> >
> >We don't seem to have a t5563. Do you mean t5562?
> Yes.

I've seen issues with t5562.8, as well. It doesn't _fail_, but I
occasionally notice that it takes a very long time to complete. If I
instrument the test suite like this:

diff --git a/t/test-lib.sh b/t/test-lib.sh
index abcfbed6d6..45e6062e3a 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -989,10 +989,18 @@ test_run_ () {
 		trace=$trace_tmp
 	fi

+	start=$(date +%s)
 	setup_malloc_check
 	test_eval_ "$1"
 	eval_ret=$?
 	teardown_malloc_check
+	end=$(date +%s)
+
+	len=$((end - start))
+	if test $len -gt 5; then
+		echo >&2 "took $len seconds to finish; failing"
+		eval_ret=1
+	fi

 	if test -z "$immediate" || test $eval_ret = 0 ||
 	   test -n "$expecting_failure" && test "$test_cleanup" != ":"

And then run:

  ./t5562-http-backend-content-length.sh --stress

I get many successful completions, but eventually one fails t5562.8
with:

  took 60 seconds to finish; failing

That 60 seconds is the timeout from t5562/invoke-with-content-length.

So one, are you sure it's hanging forever, and not just for 60 seconds?
And two, it is quite obvious there's some racing here. I'm not sure if
this is indicative of a problem in the test suite, or in http-backend
itself (in which case it could be affecting real users).

> This looks strange. There is no signal 34 defined by the platform.
> 
> test_must_fail: died by signal 34: git send-email --from=Example <nobody@example.com> --to=nobody@example.com --smtp-server=/home/git/git/t/trash directory.t9001-send-email/fake.sendmail --transfer-encoding=8bit 0001-Second.patch longline.patch
> not ok 33 - reject long lines
> [...]

On my system 34 is SIGRTMIN, so I agree that seems unlikely. Keep in
mind, though, that we are getting the exit code from the shell, so we
interpret codes between 129 and 192 as 128+signo, which is the shell's
convention.

If the program actually called exit(162), we'd see that as "signal
34".  Likewise for exit(-94), since the result is cast to unsigned.

Now why send-email would do that, I have no idea. The expected error
from this test (the "... is longer than 998" message) should come from
calling perl's "die", which usually exits with code 255. You can double
check with:

  perl -e 'die "foo"'
  echo $?

but I'd be surprised if it varies between systems (and also I expect
other tests sometimes call die, too).

Beyond that, I expect your best bet is to instrument send-email itself
to print more information about how it's exiting here (or possibly use
your platform's equivalent of strace, if it has one).

-Peff

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 18:31   ` Randall S. Becker
@ 2021-08-16 18:36     ` Jeff King
  2021-08-16 20:43       ` Randall S. Becker
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2021-08-16 18:36 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 02:31:34PM -0400, Randall S. Becker wrote:

> >33 - reject long lines #
> >#               z8=zzzzzzzz &&
> >#               z64=$z8$z8$z8$z8$z8$z8$z8$z8 &&
> >#               z512=$z64$z64$z64$z64$z64$z64$z64$z64 &&
> >#               clean_fake_sendmail &&
> >#               cp $patches longline.patch &&
> >#               cat >>longline.patch <<-EOF &&
> >#               $z512$z512
> >#               not a long line
> >#               $z512$z512
> >#               EOF
> >#               test_must_fail git send-email \
> >#                       --from="Example <nobody@example.com>" \
> >#                       --to=nobody@example.com \
> >#                       --smtp-server="$(pwd)/fake.sendmail" \
> >#                       --transfer-encoding=8bit \
> >#                       $patches longline.patch \
> >#                       2>actual &&
> >#               cat >expect <<-\EOF &&
> >#               fatal: longline.patch:35 is longer than 998 characters
> >#               warning: no patches were sent
> >#               EOF
> >#               test_cmp expect actual
> >#
> >/
> 
> I should point out that all 6 failures in t9001 have the same characteristic - signal 34.

Oh. Then the notion from my other mail of "if it's die(), then other
tests would presumably see similar failures" might be true. ;)

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 18:35     ` Jeff King
@ 2021-08-16 18:54       ` Randall S. Becker
  2021-08-16 21:55         ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Randall S. Becker @ 2021-08-16 18:54 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 2:35 PM, Jeff King wrote:
>On Mon, Aug 16, 2021 at 02:08:38PM -0400, Randall S. Becker wrote:
>
>> On August 13, 2021 12:07 PM, Jeff King wrote:
>> >On Fri, Aug 13, 2021 at 10:22:19AM -0400, Randall S. Becker wrote:
>> >
>> >> >Just a report that t0301.9 hung again on 2.32.0-rc1 on the NonStop
>> >> >ia64 platform, and t5563.8 hung on the x86 platform. These did not
>> >
>> >We don't seem to have a t5563. Do you mean t5562?
>> Yes.
>
>I've seen issues with t5562.8, as well. It doesn't _fail_, but I occasionally notice that it takes a very long time to complete. If I instrument
>the test suite like this:
>
>diff --git a/t/test-lib.sh b/t/test-lib.sh index abcfbed6d6..45e6062e3a 100644
>--- a/t/test-lib.sh
>+++ b/t/test-lib.sh
>@@ -989,10 +989,18 @@ test_run_ () {
> 		trace=$trace_tmp
> 	fi
>
>+	start=$(date +%s)
> 	setup_malloc_check
> 	test_eval_ "$1"
> 	eval_ret=$?
> 	teardown_malloc_check
>+	end=$(date +%s)
>+
>+	len=$((end - start))
>+	if test $len -gt 5; then
>+		echo >&2 "took $len seconds to finish; failing"
>+		eval_ret=1
>+	fi
>
> 	if test -z "$immediate" || test $eval_ret = 0 ||
> 	   test -n "$expecting_failure" && test "$test_cleanup" != ":"
>
>And then run:
>
>  ./t5562-http-backend-content-length.sh --stress
>
>I get many successful completions, but eventually one fails t5562.8
>with:
>
>  took 60 seconds to finish; failing
>
>That 60 seconds is the timeout from t5562/invoke-with-content-length.
>
>So one, are you sure it's hanging forever, and not just for 60 seconds?

Absolutely sure. 48 hours because I forgot to check.

>And two, it is quite obvious there's some racing here. I'm not sure if this is indicative of a problem in the test suite, or in http-backend
>itself (in which case it could be affecting real users).

How can I help track this down?

>> This looks strange. There is no signal 34 defined by the platform.
>>
>> test_must_fail: died by signal 34: git send-email --from=Example
>> <nobody@example.com> --to=nobody@example.com
>> --smtp-server=/home/git/git/t/trash
>> directory.t9001-send-email/fake.sendmail --transfer-encoding=8bit
>> 0001-Second.patch longline.patch not ok 33 - reject long lines [...]
>
>On my system 34 is SIGRTMIN, so I agree that seems unlikely. Keep in mind, though, that we are getting the exit code from the shell, so
>we interpret codes between 129 and 192 as 128+signo, which is the shell's convention.
>
>If the program actually called exit(162), we'd see that as "signal 34".  Likewise for exit(-94), since the result is cast to unsigned.
>
>Now why send-email would do that, I have no idea. The expected error from this test (the "... is longer than 998" message) should come
>from calling perl's "die", which usually exits with code 255. You can double check with:
>
>  perl -e 'die "foo"'
>  echo $?
>
>but I'd be surprised if it varies between systems (and also I expect other tests sometimes call die, too).
>
>Beyond that, I expect your best bet is to instrument send-email itself to print more information about how it's exiting here (or possibly use
>your platform's equivalent of strace, if it has one).

We're getting exit code 255, not 162 in this test. t9001 is the only test where this is happening. The only other test that occasionally hangs is t0301, but not any consistency on which subtest although 6 or 8 is more often than not the offender. Interestingly, perl is not in ${prefix}. It comes from /usr/bin/perl (5.26.3), but PERL_PATH is specified explicitly in config.mak.uname as /usr/bin/perl, so:

use lib (split(/:/, $ENV{GITPERLLIB} || '/usr/local-ssl1.1/share/perl5'));

may or may not be an issue. There is another quirk on the platform. The first output message to stdout determines the encoding. If wchar is used first, and then something else uses char for output, the program behaves strangely - and output disappears randomly; the reverse is also true - and this does apply across a fork. Sadly, my perl is not expert level.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 18:36     ` Jeff King
@ 2021-08-16 20:43       ` Randall S. Becker
  2021-08-16 21:02         ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Randall S. Becker @ 2021-08-16 20:43 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 2:36 PM, Jeff King wrote:
>To: Randall S. Becker <rsbecker@nexbridge.com>
>Cc: 'Junio C Hamano' <gitster@pobox.com>; git@vger.kernel.org
>Subject: Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
>
>On Mon, Aug 16, 2021 at 02:31:34PM -0400, Randall S. Becker wrote:
>
>> >33 - reject long lines #
>> >#               z8=zzzzzzzz &&
>> >#               z64=$z8$z8$z8$z8$z8$z8$z8$z8 &&
>> >#               z512=$z64$z64$z64$z64$z64$z64$z64$z64 &&
>> >#               clean_fake_sendmail &&
>> >#               cp $patches longline.patch &&
>> >#               cat >>longline.patch <<-EOF &&
>> >#               $z512$z512
>> >#               not a long line
>> >#               $z512$z512
>> >#               EOF
>> >#               test_must_fail git send-email \
>> >#                       --from="Example <nobody@example.com>" \
>> >#                       --to=nobody@example.com \
>> >#                       --smtp-server="$(pwd)/fake.sendmail" \
>> >#                       --transfer-encoding=8bit \
>> >#                       $patches longline.patch \
>> >#                       2>actual &&
>> >#               cat >expect <<-\EOF &&
>> >#               fatal: longline.patch:35 is longer than 998 characters
>> >#               warning: no patches were sent
>> >#               EOF
>> >#               test_cmp expect actual
>> >#
>> >/
>>
>> I should point out that all 6 failures in t9001 have the same characteristic - signal 34.
>
>Oh. Then the notion from my other mail of "if it's die(), then other tests would presumably see similar failures" might be true. ;)

When running 

/home/git/git/t/trash directory.t9001-send-email: git send-email --from="Example <nobody@example.com>" --to=nobody@example.com --smtp-server="/home/git/git/t/trash directory.t9001-send-email/fake.sendmail" --transfer-encoding=8bit 0001-Second.patch longline.patch
fatal: longline.patch:35 is longer than 998 characters
warning: no patches were sent
/home/git/git/t/trash directory.t9001-send-email: echo $?
162

So this is strange. Where is perl run? I'd like to catch the completion inside git.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 20:43       ` Randall S. Becker
@ 2021-08-16 21:02         ` Jeff King
  2021-08-16 21:54           ` Randall S. Becker
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2021-08-16 21:02 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 04:43:25PM -0400, Randall S. Becker wrote:

> >Oh. Then the notion from my other mail of "if it's die(), then other tests would presumably see similar failures" might be true. ;)
> 
> When running 
> 
> /home/git/git/t/trash directory.t9001-send-email: git send-email --from="Example <nobody@example.com>" --to=nobody@example.com --smtp-server="/home/git/git/t/trash directory.t9001-send-email/fake.sendmail" --transfer-encoding=8bit 0001-Second.patch longline.patch
> fatal: longline.patch:35 is longer than 998 characters
> warning: no patches were sent
> /home/git/git/t/trash directory.t9001-send-email: echo $?
> 162

Well, that's a promising start to finding the source. :)

> So this is strange. Where is perl run? I'd like to catch the completion inside git.

This will all go through execv_dashed_external() in git.c. So we should
just be exiting with the status code we got from the child via wait().

You could try:

  - running it as git-send-email (with a dash), which will exec the perl
    script directly, rather than going through the main git binary

  - instrumenting run-command.c:wait_or_whine() to see how it interprets
    the value. If perl really is returning 255, then perhaps your
    platform's WIFSIGNALED() is confused by that.

If the problem does only show when going through the git binary, then I
suspect:

  git -c alias.foo='!perl -e die' foo

may be an easier reproduction.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 21:02         ` Jeff King
@ 2021-08-16 21:54           ` Randall S. Becker
  2021-08-16 22:22             ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Randall S. Becker @ 2021-08-16 21:54 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 5:02 PM, Jeff King wrote:
>To: Randall S. Becker <rsbecker@nexbridge.com>
>Cc: 'Junio C Hamano' <gitster@pobox.com>; git@vger.kernel.org
>Subject: Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
>
>On Mon, Aug 16, 2021 at 04:43:25PM -0400, Randall S. Becker wrote:
>
>> >Oh. Then the notion from my other mail of "if it's die(), then other
>> >tests would presumably see similar failures" might be true. ;)
>>
>> When running
>>
>> /home/git/git/t/trash directory.t9001-send-email: git send-email
>> --from="Example <nobody@example.com>" --to=nobody@example.com
>> --smtp-server="/home/git/git/t/trash
>> directory.t9001-send-email/fake.sendmail" --transfer-encoding=8bit
>> 0001-Second.patch longline.patch
>> fatal: longline.patch:35 is longer than 998 characters
>> warning: no patches were sent
>> /home/git/git/t/trash directory.t9001-send-email: echo $?
>> 162
>
>Well, that's a promising start to finding the source. :)
>
>> So this is strange. Where is perl run? I'd like to catch the completion inside git.
>
>This will all go through execv_dashed_external() in git.c. So we should just be exiting with the status code we got from the child via wait().
>
>You could try:
>
>  - running it as git-send-email (with a dash), which will exec the perl
>    script directly, rather than going through the main git binary
>
>  - instrumenting run-command.c:wait_or_whine() to see how it interprets
>    the value. If perl really is returning 255, then perhaps your
>    platform's WIFSIGNALED() is confused by that.
>
>If the problem does only show when going through the git binary, then I
>suspect:
>
>  git -c alias.foo='!perl -e die' foo
>
>may be an easier reproduction.

Running git-send-email reports completion 162. The code variable is optimized out but looks like it also is 162 when returning. The WIFEXITED(status) code did not appear to execute, although I think that also was optimized out. finish_command ret is 162. So perl looks like it is completing with a bad completion code. This percolates up to git, which also reports the same value.

I went to the perl maintainer on this subject. What I got back was that die is not guaranteed to return a specific value other than 0 for success and non-zero for failure. There are platforms where the return might known and has meaning but that is not portable. According to the current official perl documentation:

"die raises an exception. Inside an eval the exception is stuffed into $@ and the eval is terminated with the undefined value. If the exception is outside of all enclosing evals, then the uncaught exception is printed to STDERR and perl exits with an exit code indicating failure. If you need to exit the process with a specific exit code, see exit."

So assuming that a signal occurred because the value is between 129 and 192 is not correct in the case of perl. Could we do something like test_expect_perl_die that does not perform the signal check at line 980 in test-lib-functions.sh so just checks 0 vs. non-zero, which would be semantically correct no matter what the platform? Alternatively, and possibly better, the die could be caught and then exit() called in git-send-email, as in:

eval { die "Something bad happened" };
exit(255) if $@;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 18:54       ` Randall S. Becker
@ 2021-08-16 21:55         ` Jeff King
  2021-08-16 21:59           ` Jeff King
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2021-08-16 21:55 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: Max Kirillov, 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 02:54:14PM -0400, Randall S. Becker wrote:

> >That 60 seconds is the timeout from t5562/invoke-with-content-length.
> >
> >So one, are you sure it's hanging forever, and not just for 60 seconds?
> 
> Absolutely sure. 48 hours because I forgot to check.
> 
> >And two, it is quite obvious there's some racing here. I'm not sure if this is indicative of a problem in the test suite, or in http-backend
> >itself (in which case it could be affecting real users).
> 
> How can I help track this down?

Here's what I found out so far. For my 60-second lag case, the test
_does_ complete as expected; it just takes a long time. So I think what
happens is this:

  - the invoke-with-content-length script sets up a SIGCLD handler

  - then it kicks off http-backend and writes to it

  - then it sleeps for 60 seconds, assuming that SIGCLD will interrupt
    the sleep

  - after the sleep finishes (whether by 60 seconds or because it was
    interrupted by the signal), we check a flag to see if our SIGCLD
    handler was called. If not, then we complain.

This usually completes instantaneously-ish, because the signal
interrupts our sleep. But very occasionally the child process dies
_before_ we hit the sleep, so we don't realize it.

So ideally we'd have some way of atomically checking our flag and then
sleeping only if it's not set. But I don't think that exists. The
closest we can come is using a series of smaller sleeps and checks. And
indeed, digging in the archive shows that Max already proposed such a
patch:

  https://lore.kernel.org/git/20190218205028.32486-1-max@max630.net/

It looks like it feel through the cracks, though. Maybe now is a good
time to resurrect it.

However, you are in that thread, too, and it didn't help your situation.
So I think your race is somehow different. It looks like there was some
weirdness around close() for you, though generally we _shouldn't_ be
hitting that close() at all, because we'd have gotten SIGCLD and set the
$exited flag in the interim.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 21:55         ` Jeff King
@ 2021-08-16 21:59           ` Jeff King
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2021-08-16 21:59 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: Max Kirillov, 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 05:55:01PM -0400, Jeff King wrote:

> So ideally we'd have some way of atomically checking our flag and then
> sleeping only if it's not set. But I don't think that exists. The
> closest we can come is using a series of smaller sleeps and checks. And
> indeed, digging in the archive shows that Max already proposed such a
> patch:
> 
>   https://lore.kernel.org/git/20190218205028.32486-1-max@max630.net/
> 
> It looks like it feel through the cracks, though. Maybe now is a good
> time to resurrect it.

And here it is for convenience. I think this is worth doing, as it
avoids occasional 60-second hangs in the test suite (and I confirmed
that with the "fail if it takes more than 5 seconds" hack I showed
earlier).

-- >8 --
From: Max Kirillov <max@max630.net>
Subject: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD

If was found during stress-test run that a test may hang by 60 seconds.
It supposedly happens because SIGCHILD was received before sleep has
started.

Fix by looping by smaller chunks, checking $exited after each of them.
Then lost SIGCHILD would not cause longer delay than 1 second.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Max Kirillov <max@max630.net>
Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5562/invoke-with-content-length.pl | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-content-length.pl
index 0943474af2..257e280e3b 100644
--- a/t/t5562/invoke-with-content-length.pl
+++ b/t/t5562/invoke-with-content-length.pl
@@ -29,7 +29,12 @@
 }
 print $out $body_data or die "Cannot write data: $!";
 
-sleep 60; # is interrupted by SIGCHLD
+my $counter = 0;
+while (not $exited and $counter < 60) {
+        sleep 1;
+        $counter = $counter + 1;
+}
+
 if (!$exited) {
         close($out);
         die "Command did not exit after reading whole body";
-- 
2.33.0.rc2.497.g375df73092


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 21:54           ` Randall S. Becker
@ 2021-08-16 22:22             ` Jeff King
  2021-08-16 22:29               ` Jeff King
  2021-08-17 14:29               ` Randall S. Becker
  0 siblings, 2 replies; 17+ messages in thread
From: Jeff King @ 2021-08-16 22:22 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 05:54:44PM -0400, Randall S. Becker wrote:

> Running git-send-email reports completion 162. The code variable is
> optimized out but looks like it also is 162 when returning. The
> WIFEXITED(status) code did not appear to execute, although I think
> that also was optimized out. finish_command ret is 162. So perl looks
> like it is completing with a bad completion code. This percolates up
> to git, which also reports the same value.

OK, at least that absolves git.c. :)

> I went to the perl maintainer on this subject. What I got back was
> that die is not guaranteed to return a specific value other than 0 for
> success and non-zero for failure. There are platforms where the return
> might known and has meaning but that is not portable. According to the
> current official perl documentation:
> 
> "die raises an exception. Inside an eval the exception is stuffed into
> $@ and the eval is terminated with the undefined value. If the
> exception is outside of all enclosing evals, then the uncaught
> exception is printed to STDERR and perl exits with an exit code
> indicating failure. If you need to exit the process with a specific
> exit code, see exit."

Ouch. I mean, sure, if you need a specific code, I get that die is not a
good tool. But getting arbitrary values seems kind of weird and
unfriendly. The perldoc for die does say it gives you $! (errno), or $?
(the last child exit value) if appropriate. So it's not completely
arbitrary, but I think your errno value may just be unlucky.

> So assuming that a signal occurred because the value is between 129
> and 192 is not correct in the case of perl. Could we do something like
> test_expect_perl_die that does not perform the signal check at line
> 980 in test-lib-functions.sh so just checks 0 vs. non-zero, which
> would be semantically correct no matter what the platform?
> Alternatively, and possibly better, the die could be caught and then
> exit() called in git-send-email, as in:
> 
> eval { die "Something bad happened" };
> exit(255) if $@;

Yeah, I think we are better to get a consistent exit code from perl.
There are a few options here:

 - wrapping in an eval works, as you showed above. It's a little awkward
   to wrap the whole script, though.

 - there's $SIG{__DIE__}, but the manpage warns against using it. You
   can use it something like this:

     sub catch_top {
       CORE::die @_ if $^S; # in an eval; use regular die
       CORE::die @_ if !defined $^S; in perl's parser
       print STDERR "@_\n";
       exit 255; # or whatever we want
     }
     $SIG{__DIE__} = \&catch_top;

 - you can hook die() like this:

     BEGIN { *CORE::GLOBAL::die = \&my_die; }

   but I expect would still need to check that you're not in an eval, as
   above.

  - The SIG{__DIE__} docs mention using an END{} block, but I'm not sure
    how you determine if we hit a die or not (at that point, $@ won't
    actually be set).

I've used the catch_top() thing before and it does work (it's just ugly
that you have to deal with $^S).

I guess yet another alternative is that we could avoid using perl's
die() in favor of our own custom-named function. It seems like that may
confuse folks who come later, though.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 22:22             ` Jeff King
@ 2021-08-16 22:29               ` Jeff King
  2021-08-17 14:30                 ` Randall S. Becker
  2021-08-17 14:29               ` Randall S. Becker
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2021-08-16 22:29 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: 'Junio C Hamano', git

On Mon, Aug 16, 2021 at 06:22:59PM -0400, Jeff King wrote:

> Yeah, I think we are better to get a consistent exit code from perl.
> There are a few options here:

So concretely, the patch below works for me (my tests are not failing,
but with some instrumenting, I see that the case in question is exiting
with code 25 before this patch, and 255 after. So it really is just that
my errno values are lower than yours).

It's ugly, and I kind of wonder if we'd want to do it for every script
to get consistent exit codes. But it does work.

diff --git a/git-send-email.perl b/git-send-email.perl
index e65d969d0b..c82336c2e8 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -35,8 +35,15 @@ sub readline {
 	my $self = shift;
 	die "Cannot use readline on FakeTerm: $$self";
 }
+
 package main;
 
+$SIG{__DIE__} = sub {
+	CORE::die @_ if $^S; # in an eval; use regular die
+	CORE::die @_ if !defined $^S; # in perl's parser
+	print STDERR "fatal: @_\n";
+	exit 255;
+};
 
 sub usage {
 	print <<EOT;

-Peff

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 22:22             ` Jeff King
  2021-08-16 22:29               ` Jeff King
@ 2021-08-17 14:29               ` Randall S. Becker
  1 sibling, 0 replies; 17+ messages in thread
From: Randall S. Becker @ 2021-08-17 14:29 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 6:23 PM, Jeff King wrote:
>On Mon, Aug 16, 2021 at 05:54:44PM -0400, Randall S. Becker wrote:
>
>> Running git-send-email reports completion 162. The code variable is
>> optimized out but looks like it also is 162 when returning. The
>> WIFEXITED(status) code did not appear to execute, although I think
>> that also was optimized out. finish_command ret is 162. So perl looks
>> like it is completing with a bad completion code. This percolates up
>> to git, which also reports the same value.
>
>OK, at least that absolves git.c. :)

Too right 😊

>> I went to the perl maintainer on this subject. What I got back was
>> that die is not guaranteed to return a specific value other than 0 for
>> success and non-zero for failure. There are platforms where the return
>> might known and has meaning but that is not portable. According to the
>> current official perl documentation:
>>
>> "die raises an exception. Inside an eval the exception is stuffed into
>> $@ and the eval is terminated with the undefined value. If the
>> exception is outside of all enclosing evals, then the uncaught
>> exception is printed to STDERR and perl exits with an exit code
>> indicating failure. If you need to exit the process with a specific
>> exit code, see exit."
>
>Ouch. I mean, sure, if you need a specific code, I get that die is not a good tool. But getting arbitrary values seems kind of weird and
>unfriendly. The perldoc for die does say it gives you $! (errno), or $?
>(the last child exit value) if appropriate. So it's not completely arbitrary, but I think your errno value may just be unlucky.

Very unlucky.
>
>> So assuming that a signal occurred because the value is between 129
>> and 192 is not correct in the case of perl. Could we do something like
>> test_expect_perl_die that does not perform the signal check at line
>> 980 in test-lib-functions.sh so just checks 0 vs. non-zero, which
>> would be semantically correct no matter what the platform?
>> Alternatively, and possibly better, the die could be caught and then
>> exit() called in git-send-email, as in:
>>
>> eval { die "Something bad happened" };
>> exit(255) if $@;
>
>Yeah, I think we are better to get a consistent exit code from perl.
>There are a few options here:
>
> - wrapping in an eval works, as you showed above. It's a little awkward
>   to wrap the whole script, though.
>
> - there's $SIG{__DIE__}, but the manpage warns against using it. You
>   can use it something like this:
>
>     sub catch_top {
>       CORE::die @_ if $^S; # in an eval; use regular die
>       CORE::die @_ if !defined $^S; in perl's parser
>       print STDERR "@_\n";
>       exit 255; # or whatever we want
>     }
>     $SIG{__DIE__} = \&catch_top;
>
> - you can hook die() like this:
>
>     BEGIN { *CORE::GLOBAL::die = \&my_die; }
>
>   but I expect would still need to check that you're not in an eval, as
>   above.
>
>  - The SIG{__DIE__} docs mention using an END{} block, but I'm not sure
>    how you determine if we hit a die or not (at that point, $@ won't
>    actually be set).
>
>I've used the catch_top() thing before and it does work (it's just ugly that you have to deal with $^S).
>
>I guess yet another alternative is that we could avoid using perl's
>die() in favor of our own custom-named function. It seems like that may confuse folks who come later, though.

I have tried to arrange a more consistent error code from the perl team. There are too many platform mods, and getting this particular change done won't happen - I have already been told something like "Sure, but in an RFE. We'll evaluate it.", which will mean 18 months minimum on x86 if approved (unlikely) and won't happen on the ia64 platform - ever. I think this is ultimately resulting from what is on the stack when the exception is thrown inside perl that is different on the platform. It did change - it was actually 169 two years ago with the same effect. Changing it will mean a custom mod to base perl - not likely going to happen.

What it we left the perl code as is and went with the following, since this seems to happen only above 128:

# Similar to test_must_fail, but just considers pass/fail on completion codes
test_perl_must_die () {
        case "$1" in
        ok=*)
                _test_ok=${1#ok=}
                shift
                ;;
        *)
                _test_ok=
                ;;
        esac
        if ! test_must_fail_acceptable "$@"
        then
                echo >&7 " test_perl_must_die: only 'git' is allowed: $*"
                return 1
        fi
        "$@" 2>&7
        exit_code=$?
        if test $exit_code -eq 0 && ! list_contains "$_test_ok" success
        then
                echo >&4 " test_perl_must_die: command succeeded: $*"
                return 1
        elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe
        then
                return 0
        elif test $exit_code -eq 127
        then
                echo >&4 " test_perl_must_die: command not found: $*"
                return 1
        elif test $exit_code -eq 126
        then
                echo >&4 " test_perl_must_die: valgrind error: $*"
                return 1
        fi
        return 0
} 7>&2 2>&4

Then change t9001 to refer to this function instead of test_must_fail.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
  2021-08-16 22:29               ` Jeff King
@ 2021-08-17 14:30                 ` Randall S. Becker
  0 siblings, 0 replies; 17+ messages in thread
From: Randall S. Becker @ 2021-08-17 14:30 UTC (permalink / raw)
  To: 'Jeff King'; +Cc: 'Junio C Hamano', git

On August 16, 2021 6:30 PM, Jeff King wrote:
>To: Randall S. Becker <rsbecker@nexbridge.com>
>Cc: 'Junio C Hamano' <gitster@pobox.com>; git@vger.kernel.org
>Subject: Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)
>
>On Mon, Aug 16, 2021 at 06:22:59PM -0400, Jeff King wrote:
>
>> Yeah, I think we are better to get a consistent exit code from perl.
>> There are a few options here:
>
>So concretely, the patch below works for me (my tests are not failing, but with some instrumenting, I see that the case in question is
>exiting with code 25 before this patch, and 255 after. So it really is just that my errno values are lower than yours).
>
>It's ugly, and I kind of wonder if we'd want to do it for every script to get consistent exit codes. But it does work.
>
>diff --git a/git-send-email.perl b/git-send-email.perl index e65d969d0b..c82336c2e8 100755
>--- a/git-send-email.perl
>+++ b/git-send-email.perl
>@@ -35,8 +35,15 @@ sub readline {
> 	my $self = shift;
> 	die "Cannot use readline on FakeTerm: $$self";  }
>+
> package main;
>
>+$SIG{__DIE__} = sub {
>+	CORE::die @_ if $^S; # in an eval; use regular die
>+	CORE::die @_ if !defined $^S; # in perl's parser
>+	print STDERR "fatal: @_\n";
>+	exit 255;
>+};
>
> sub usage {
> 	print <<EOT;

Not as ugly as my suggestion (previous email).
-Randall


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-17 14:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-13 14:22 [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report) Randall S. Becker
2021-08-13 15:00 ` Junio C Hamano
2021-08-13 16:06 ` Jeff King
2021-08-16 18:08   ` Randall S. Becker
2021-08-16 18:35     ` Jeff King
2021-08-16 18:54       ` Randall S. Becker
2021-08-16 21:55         ` Jeff King
2021-08-16 21:59           ` Jeff King
2021-08-16 18:31   ` Randall S. Becker
2021-08-16 18:36     ` Jeff King
2021-08-16 20:43       ` Randall S. Becker
2021-08-16 21:02         ` Jeff King
2021-08-16 21:54           ` Randall S. Becker
2021-08-16 22:22             ` Jeff King
2021-08-16 22:29               ` Jeff King
2021-08-17 14:30                 ` Randall S. Becker
2021-08-17 14:29               ` Randall S. Becker

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).