git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* RE: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
@ 2019-02-14 15:04 Randall S. Becker
  2019-02-14 19:56 ` Junio C Hamano
  2019-02-14 21:36 ` Johannes Schindelin
  0 siblings, 2 replies; 13+ messages in thread
From: Randall S. Becker @ 2019-02-14 15:04 UTC (permalink / raw)
  To: 'Junio C Hamano', git; +Cc: szeder.dev, 'Max Kirillov'

On February 13, 2019 22:33, Junio C Hamano wrote:
> A release candidate Git v2.21.0-rc1 is now available for testing at the usual
> places.  It is comprised of 464 non-merge commits since v2.20.0, contributed
> by 60 people, 14 of which are new faces.

We are currently running through a full regression of v2.21.0-rc1 on NonStop. It will take about 30 hours, but preliminary results, relative to breakages found in rc0 are:

t1308 is fixed.
t1404 is still broken (explainable) - scraping strerror output mismatches reported error on NonStop for EEXIST
t5318 is fixed.
t5403 is fixed.
t5562 still hangs (blocking) - this breaks our CI pipeline since the test hangs and we have no explanation of whether the hang is in git or the tests.

Cheers,
Randall

-- Brief whoami:
 NonStop developer since approximately 211288444200000000
 UNIX developer since approximately 421664400
-- In my real life, I talk too much.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-14 15:04 [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results) Randall S. Becker
@ 2019-02-14 19:56 ` Junio C Hamano
  2019-02-14 21:36 ` Johannes Schindelin
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2019-02-14 19:56 UTC (permalink / raw)
  To: Randall S. Becker; +Cc: git, szeder.dev, 'Max Kirillov'

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

> On February 13, 2019 22:33, Junio C Hamano wrote:
>> A release candidate Git v2.21.0-rc1 is now available for testing at the usual
>> places.  It is comprised of 464 non-merge commits since v2.20.0, contributed
>> by 60 people, 14 of which are new faces.
>
> We are currently running through a full regression of v2.21.0-rc1
> on NonStop. It will take about 30 hours, but preliminary results,
> relative to breakages found in rc0 are:
>
> t1308 is fixed.

Nice.

> t1404 is still broken (explainable) - scraping strerror output
> mismatches reported error on NonStop for EEXIST

IIRC, the consensus was to loosen by not matching for the error
message?  Let me take a look later today.

> t5318 is fixed.
> t5403 is fixed.

Good.

> t5562 still hangs (blocking) - this breaks our CI pipeline since
> the test hangs and we have no explanation of whether the hang is
> in git or the tests.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-14 15:04 [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results) Randall S. Becker
  2019-02-14 19:56 ` Junio C Hamano
@ 2019-02-14 21:36 ` Johannes Schindelin
  2019-02-14 22:25   ` Randall S. Becker
  2019-02-15 13:02   ` SZEDER Gábor
  1 sibling, 2 replies; 13+ messages in thread
From: Johannes Schindelin @ 2019-02-14 21:36 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: 'Junio C Hamano', git, szeder.dev, 'Max Kirillov'

Hi Randall,

On Thu, 14 Feb 2019, Randall S. Becker wrote:

> t5562 still hangs (blocking) - this breaks our CI pipeline since the
> test hangs and we have no explanation of whether the hang is in git or
> the tests.

I have "good" news: it now also hangs on Ubuntu 16.04 in Azure Pipelines'
Linux agents.

There is a silver lining with those good news, though: I found a
workaround, and it might work for you, too:

	https://github.com/gitgitgadget/git/pull/126

(I also submitted this to the Git mailing list, as I really wanted to tag
Git for Windows' v2.21.0-rc1.windows.1 only with a passing build, and I do
not want to keep that patch to the Windows port only.)

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-14 21:36 ` Johannes Schindelin
@ 2019-02-14 22:25   ` Randall S. Becker
  2019-02-15 13:02   ` SZEDER Gábor
  1 sibling, 0 replies; 13+ messages in thread
From: Randall S. Becker @ 2019-02-14 22:25 UTC (permalink / raw)
  To: 'Johannes Schindelin'
  Cc: 'Junio C Hamano', git, szeder.dev, 'Max Kirillov'

On February 14, 2019 16:37, Johannes Schindelin wrote:
> On Thu, 14 Feb 2019, Randall S. Becker wrote:
> 
> > t5562 still hangs (blocking) - this breaks our CI pipeline since the
> > test hangs and we have no explanation of whether the hang is in git or
> > the tests.
> 
> I have "good" news: it now also hangs on Ubuntu 16.04 in Azure Pipelines'
> Linux agents.
> 
> There is a silver lining with those good news, though: I found a
workaround,
> and it might work for you, too:
> 
> 	https://github.com/gitgitgadget/git/pull/126
> 
> (I also submitted this to the Git mailing list, as I really wanted to tag
Git for
> Windows' v2.21.0-rc1.windows.1 only with a passing build, and I do not
want
> to keep that patch to the Windows port only.)

Thanks for trying. It was a good try, but did not fix the hang. See my other
response for the stack trace. I tried debugging once it hung, but the code
never exits from the operating system, so I can't get inside. It is hiding
in waitpid on a process that exists otherwise we would get an error (EINTR,
ECHILD, EFAULT are possible returns). One thing to consider is that we do
not have kernel threads, so if that is assumed, that is badness.

Regards,
Randall


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-14 21:36 ` Johannes Schindelin
  2019-02-14 22:25   ` Randall S. Becker
@ 2019-02-15 13:02   ` SZEDER Gábor
  2019-02-15 13:49     ` Randall S. Becker
                       ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: SZEDER Gábor @ 2019-02-15 13:02 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Randall S. Becker, 'Junio C Hamano', git,
	'Max Kirillov'

On Thu, Feb 14, 2019 at 10:36:42PM +0100, Johannes Schindelin wrote:
> On Thu, 14 Feb 2019, Randall S. Becker wrote:
> 
> > t5562 still hangs (blocking) - this breaks our CI pipeline since the
> > test hangs and we have no explanation of whether the hang is in git or
> > the tests.
> 
> I have "good" news: it now also hangs on Ubuntu 16.04 in Azure Pipelines'
> Linux agents.

I haven't yet seen that hang in the wild and couldn't reproduce it on
purpose, but there is definitely something fishy with t5562 even on
Linux and even without that perl generate_zero_bytes helper.

  $ git checkout cc95bc2025^
  Previous HEAD position was cc95bc2025 t5562: replace /dev/zero with a pipe from generate_zero_bytes
  HEAD is now at 24b451e77c t5318: replace use of /dev/zero with generate_zero_bytes
  $ make
  <snip>
  $ cd t
  # take note of the shell's PID
  $ echo $$
  15522
  $ ./t5562-http-backend-content-length.sh --stress |tee LOG
  OK    3.0
  OK    1.0
  OK    6.0
  OK    0.0
  <snap>

And then in another terminal run this:

  $ pstree -a -p 15522

or, to make it easier noticable what changed and what stayed the same:

  $ watch -d pstree -a -p 15522

The output will sooner or later will look like this:

  bash,15522
    └─t5562-http-back,21082 ./t5562-http-backend-content-length.sh --stress
        ├─t5562-http-back,21089 ./t5562-http-backend-content-length.sh --stress
        │   └─sh,24906 ./t5562-http-backend-content-length.sh --stress
        ├─t5562-http-back,21090 ./t5562-http-backend-content-length.sh --stress
        │   └─sh,26660 ./t5562-http-backend-content-length.sh --stress
        ├─t5562-http-back,21092 ./t5562-http-backend-content-length.sh --stress
        │   └─sh,4202 ./t5562-http-backend-content-length.sh --stress
        │       └─sh,5696 ./t5562-http-backend-content-length.sh --stress
        │           └─perl,5697 /home/szeder/src/git/t/t5562/invoke-with-content-length.pl push_body.gz.trunc git http-backend
        │               └─(git,5722)
        ├─t5562-http-back,21093 ./t5562-http-backend-content-length.sh --stress
        │   └─sh,25572 ./t5562-http-backend-content-length.sh --stress
  <snip>

It won't show most of the processes run in the tests, because they are
just too fast and short-lived.  However, occasionally it does show a
stuck git process, which is shown as <defunct> in regular 'ps aux'
output:

  szeder   5722  0.0  0.0      0     0 pts/16   Z+   13:36   0:00 [git] <defunct>

Note that this is not a "proper" hang, in the sense that this process
is not stuck forever, but only for about 1 minute, after which it
disappears, and the test continues and eventually finishes with
success.  I've looked into the logs of a couple of such stuck jobs,
and it seems that it varies in which test that git process happened to
get stuck.


 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-15 13:02   ` SZEDER Gábor
@ 2019-02-15 13:49     ` Randall S. Becker
  2019-02-15 20:37     ` Max Kirillov
  2019-02-18 20:50     ` [PATCH] t5562: chunked sleep to avoid lost SIGCHILD Max Kirillov
  2 siblings, 0 replies; 13+ messages in thread
From: Randall S. Becker @ 2019-02-15 13:49 UTC (permalink / raw)
  To: 'SZEDER Gábor', 'Johannes Schindelin'
  Cc: 'Junio C Hamano', git, 'Max Kirillov'

On February 15, 2019 8:02, SZEDER Gábor wrote:
> To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> Cc: Randall S. Becker <rsbecker@nexbridge.com>; 'Junio C Hamano'
> <gitster@pobox.com>; git@vger.kernel.org; 'Max Kirillov'
> <max@max630.net>
> Subject: Re: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
> 
> On Thu, Feb 14, 2019 at 10:36:42PM +0100, Johannes Schindelin wrote:
> > On Thu, 14 Feb 2019, Randall S. Becker wrote:
> >
> > > t5562 still hangs (blocking) - this breaks our CI pipeline since the
> > > test hangs and we have no explanation of whether the hang is in git
> > > or the tests.
> >
> > I have "good" news: it now also hangs on Ubuntu 16.04 in Azure Pipelines'
> > Linux agents.
> 
> I haven't yet seen that hang in the wild and couldn't reproduce it on purpose,
> but there is definitely something fishy with t5562 even on Linux and even
> without that perl generate_zero_bytes helper.
> 
>   $ git checkout cc95bc2025^
>   Previous HEAD position was cc95bc2025 t5562: replace /dev/zero with a
> pipe from generate_zero_bytes
>   HEAD is now at 24b451e77c t5318: replace use of /dev/zero with
> generate_zero_bytes
>   $ make
>   <snip>
>   $ cd t
>   # take note of the shell's PID
>   $ echo $$
>   15522
>   $ ./t5562-http-backend-content-length.sh --stress |tee LOG
>   OK    3.0
>   OK    1.0
>   OK    6.0
>   OK    0.0
>   <snap>
> 
> And then in another terminal run this:
> 
>   $ pstree -a -p 15522
> 
> or, to make it easier noticable what changed and what stayed the same:
> 
>   $ watch -d pstree -a -p 15522
> 
> The output will sooner or later will look like this:
> 
>   bash,15522
>     └─t5562-http-back,21082 ./t5562-http-backend-content-length.sh --stress
>         ├─t5562-http-back,21089 ./t5562-http-backend-content-length.sh --
> stress
>         │   └─sh,24906 ./t5562-http-backend-content-length.sh --stress
>         ├─t5562-http-back,21090 ./t5562-http-backend-content-length.sh --
> stress
>         │   └─sh,26660 ./t5562-http-backend-content-length.sh --stress
>         ├─t5562-http-back,21092 ./t5562-http-backend-content-length.sh --
> stress
>         │   └─sh,4202 ./t5562-http-backend-content-length.sh --stress
>         │       └─sh,5696 ./t5562-http-backend-content-length.sh --stress
>         │           └─perl,5697 /home/szeder/src/git/t/t5562/invoke-with-content-
> length.pl push_body.gz.trunc git http-backend
>         │               └─(git,5722)
>         ├─t5562-http-back,21093 ./t5562-http-backend-content-length.sh --
> stress
>         │   └─sh,25572 ./t5562-http-backend-content-length.sh --stress
>   <snip>
> 
> It won't show most of the processes run in the tests, because they are just
> too fast and short-lived.  However, occasionally it does show a stuck git
> process, which is shown as <defunct> in regular 'ps aux'
> output:
> 
>   szeder   5722  0.0  0.0      0     0 pts/16   Z+   13:36   0:00 [git] <defunct>
> 
> Note that this is not a "proper" hang, in the sense that this process is not
> stuck forever, but only for about 1 minute, after which it disappears, and the
> test continues and eventually finishes with success.  I've looked into the logs
> of a couple of such stuck jobs, and it seems that it varies in which test that git
> process happened to get stuck.

We see something similar. The 60 seconds is in the support script in the t/t5562 directory. If a SIGCHLD is received, the sleep is interrupted and perl terminates (no hang). If the sleep is not interrupted, NonStop hangs in the close() after coming out of sleep because perl still has output to send somewhere. We are hung in the close call - which is really perplexing considering a close on NonStop in any other product is immediate and rather harsh, but perl's semantics for close() are: "Closing a pipe also waits for the process executing on the pipe to complete" (from the Perl spec), which seems to apply on NonStop because the git (5722) is reading but not receiving any data and not terminating - based on your tree above. Or, in other words, perl closing the pipe will not cause git (5722) to terminate because perl is waiting on git (5722) to terminate before completing the close. The only time it would not hang is if git (5722) terminates on its own so that sleep is interrupted without going back for more data to read. I am making a semi-educated guess. From my experience with the NS perl team, they are going to point at that spec and say that perl is exhibiting the correct behaviour and that the hang is expected.

Another weird observation is that the test generates up to three hangs (subtests 6,8,13) at worst, and one (subtest 13) at best, depending on some unknown factor that might be system load. This is hinting at a race condition. Sadly, we don't have the above cool watch or pstree utilities on platform. What we do have is something called ptrace, which can look at the stack and I/O conditions of all open files and whether there are outstanding I/Os (and how many) on each FD, memory use.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-15 13:02   ` SZEDER Gábor
  2019-02-15 13:49     ` Randall S. Becker
@ 2019-02-15 20:37     ` Max Kirillov
  2019-02-15 21:13       ` Randall S. Becker
  2019-02-18 20:50     ` [PATCH] t5562: chunked sleep to avoid lost SIGCHILD Max Kirillov
  2 siblings, 1 reply; 13+ messages in thread
From: Max Kirillov @ 2019-02-15 20:37 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Johannes Schindelin, Randall S. Becker, 'Junio C Hamano',
	git, 'Max Kirillov'

On Fri, Feb 15, 2019 at 02:02:13PM +0100, SZEDER Gábor wrote:
> I haven't yet seen that hang in the wild and couldn't reproduce it on
> purpose, but there is definitely something fishy with t5562 even on
> Linux and even without that perl generate_zero_bytes helper.
> 
> It won't show most of the processes run in the tests, because they are
> just too fast and short-lived.  However, occasionally it does show a
> stuck git process, which is shown as <defunct> in regular 'ps aux'
> output:
> 
>   szeder   5722  0.0  0.0      0     0 pts/16   Z+   13:36   0:00 [git] <defunct>
> 
> Note that this is not a "proper" hang, in the sense that this process
> is not stuck forever, but only for about 1 minute

This is probably because of SIGCHILD comes before "sleep". I believe this is
unrelated to the hang issue. The hang issue looks like something is wrong with
cleanu_children(), or maybe in the child which it tries to kill and wait, not in
tests.

As for this zombie issue, could be fixed with, for example, more busy wait like
the following. It may with some bigger probability miss SIGCHILD to the first
sleep because there is a bit more to do before it. But the penalty is only 1
second now, and as it still happens rarely there seems to be no visible
degradation.

--- 8< -----------
diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-content-length.pl
index 0943474af2..257e280e3b 100644
--- a/t/t5562/invoke-with-content-length.pl
+++ b/t/t5562/invoke-with-content-length.pl
@@ -29,7 +29,12 @@
 }
 print $out $body_data or die "Cannot write data: $!";
 
-sleep 60; # is interrupted by SIGCHLD
+my $counter = 0;
+while (not $exited and $counter < 60) {
+        sleep 1;
+        $counter = $counter + 1;
+}
+
 if (!$exited) {
         close($out);
         die "Command did not exit after reading whole body";

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-15 20:37     ` Max Kirillov
@ 2019-02-15 21:13       ` Randall S. Becker
  2019-02-16  8:26         ` Max Kirillov
  0 siblings, 1 reply; 13+ messages in thread
From: Randall S. Becker @ 2019-02-15 21:13 UTC (permalink / raw)
  To: 'Max Kirillov', 'SZEDER Gábor'
  Cc: 'Johannes Schindelin', 'Junio C Hamano', git

On February 15, 2019 15:37, Max Kirillov wrote:
> On Fri, Feb 15, 2019 at 02:02:13PM +0100, SZEDER Gábor wrote:
> > I haven't yet seen that hang in the wild and couldn't reproduce it on
> > purpose, but there is definitely something fishy with t5562 even on
> > Linux and even without that perl generate_zero_bytes helper.
> >
> > It won't show most of the processes run in the tests, because they are
> > just too fast and short-lived.  However, occasionally it does show a
> > stuck git process, which is shown as <defunct> in regular 'ps aux'
> > output:
> >
> >   szeder   5722  0.0  0.0      0     0 pts/16   Z+   13:36   0:00 [git] <defunct>
> >
> > Note that this is not a "proper" hang, in the sense that this process
> > is not stuck forever, but only for about 1 minute
> 
> This is probably because of SIGCHILD comes before "sleep". I believe this is
> unrelated to the hang issue. The hang issue looks like something is wrong
> with cleanu_children(), or maybe in the child which it tries to kill and wait,
> not in tests.
> 
> As for this zombie issue, could be fixed with, for example, more busy wait
> like the following. It may with some bigger probability miss SIGCHILD to the
> first sleep because there is a bit more to do before it. But the penalty is only
> 1 second now, and as it still happens rarely there seems to be no visible
> degradation.
> 
> --- 8< -----------
> diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-
> content-length.pl
> index 0943474af2..257e280e3b 100644
> --- a/t/t5562/invoke-with-content-length.pl
> +++ b/t/t5562/invoke-with-content-length.pl
> @@ -29,7 +29,12 @@
>  }
>  print $out $body_data or die "Cannot write data: $!";
> 
> -sleep 60; # is interrupted by SIGCHLD
> +my $counter = 0;
> +while (not $exited and $counter < 60) {
> +        sleep 1;
> +        $counter = $counter + 1;
> +}
> +
>  if (!$exited) {
>          close($out);
>          die "Command did not exit after reading whole body";

From the trace I found in perl, we have gone past sleep and are hung at 
          close($out);

Commenting out the close() does nothing because perl still hangs on an implied close resulting from the exception thrown by die(). See my other post on adding GIT_TRACE and the changes resulting from that.

Sadly, the fix does not change the results. In fact, it makes the hang far more likely. Subtest 6,7,8 fails here, at close()
  waitpid + 0x130 (SLr)
  $n_EnterPriv + 0x280 (Milli)
  Perl_wait4pid + 0x130 (UCr)
  Perl_my_pclose + 0x4C0 (UCr)
  Perl_io_close + 0x180 (UCr)
  Perl_do_close + 0x620 (UCr)
  Perl_pp_close + 0xA70 (UCr)
  Perl_runops_standard + 0xF0 (UCr)
  S_run_body + 0x870 (UCr)
  perl_run + 0x2D0 (UCr)
  main + 0x3D0 (UCr)




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results)
  2019-02-15 21:13       ` Randall S. Becker
@ 2019-02-16  8:26         ` Max Kirillov
  0 siblings, 0 replies; 13+ messages in thread
From: Max Kirillov @ 2019-02-16  8:26 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: 'Max Kirillov', 'SZEDER Gábor',
	'Johannes Schindelin', 'Junio C Hamano', git

On Fri, Feb 15, 2019 at 04:13:15PM -0500, Randall S. Becker wrote:
> Sadly, the fix does not change the results. In fact, it
> makes the hang far more likely. Subtest 6,7,8 fails here,
> at close()

Correct, I did not expect it to help, it was for the other
issue.

As for the hang issue, from your another message it seems to
me that perl waiting correctly, there are really child
process which do not exit.

What you could try is
https://public-inbox.org/git/20181124093719.10705-1-max@max630.net/
(I'm not sure it would not conflict by now), this would
remove dependency between tests. If it helps it would be
very valuable information.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
  2019-02-15 13:02   ` SZEDER Gábor
  2019-02-15 13:49     ` Randall S. Becker
  2019-02-15 20:37     ` Max Kirillov
@ 2019-02-18 20:50     ` Max Kirillov
  2019-02-18 20:54       ` Randall S. Becker
  2019-02-19 18:38       ` Junio C Hamano
  2 siblings, 2 replies; 13+ messages in thread
From: Max Kirillov @ 2019-02-18 20:50 UTC (permalink / raw)
  To: SZEDER Gábor, git
  Cc: Max Kirillov, Johannes Schindelin, Randall S. Becker,
	'Junio C Hamano'

If was found during stress-test run that a test may hang by 60 seconds.
It supposedly happens because SIGCHILD was received before sleep has
started.

Fix by looping by smaller chunks, checking $exited after each of them.
Then lost SIGCHILD would not cause longer delay than 1 second.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Max Kirillov <max@max630.net>
---
Submitting as proper patch. Note: I believe it does not relate to other issues
discussed in this thread.
 t/t5562/invoke-with-content-length.pl | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-content-length.pl
index 0943474af2..257e280e3b 100644
--- a/t/t5562/invoke-with-content-length.pl
+++ b/t/t5562/invoke-with-content-length.pl
@@ -29,7 +29,12 @@
 }
 print $out $body_data or die "Cannot write data: $!";
 
-sleep 60; # is interrupted by SIGCHLD
+my $counter = 0;
+while (not $exited and $counter < 60) {
+        sleep 1;
+        $counter = $counter + 1;
+}
+
 if (!$exited) {
         close($out);
         die "Command did not exit after reading whole body";
-- 
2.19.0.1202.g68e1e8f04e


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
  2019-02-18 20:50     ` [PATCH] t5562: chunked sleep to avoid lost SIGCHILD Max Kirillov
@ 2019-02-18 20:54       ` Randall S. Becker
  2019-02-18 20:59         ` Max Kirillov
  2019-02-19 18:38       ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Randall S. Becker @ 2019-02-18 20:54 UTC (permalink / raw)
  To: 'Max Kirillov', 'SZEDER Gábor', git
  Cc: 'Johannes Schindelin', 'Junio C Hamano'

On February 18, 2019 15:50, Max Kirillov wrote:
> To: SZEDER Gábor <szeder.dev@gmail.com>; git@vger.kernel.org
> Cc: Max Kirillov <max@max630.net>; Johannes Schindelin
> <Johannes.Schindelin@gmx.de>; Randall S. Becker
> <rsbecker@nexbridge.com>; 'Junio C Hamano' <gitster@pobox.com>
> Subject: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
> 
> If was found during stress-test run that a test may hang by 60 seconds.
> It supposedly happens because SIGCHILD was received before sleep has
> started.
> 
> Fix by looping by smaller chunks, checking $exited after each of them.
> Then lost SIGCHILD would not cause longer delay than 1 second.
> 
> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Max Kirillov <max@max630.net>
> ---
> Submitting as proper patch. Note: I believe it does not relate to other issues
> discussed in this thread.
>  t/t5562/invoke-with-content-length.pl | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-
> content-length.pl
> index 0943474af2..257e280e3b 100644
> --- a/t/t5562/invoke-with-content-length.pl
> +++ b/t/t5562/invoke-with-content-length.pl
> @@ -29,7 +29,12 @@
>  }
>  print $out $body_data or die "Cannot write data: $!";
> 
> -sleep 60; # is interrupted by SIGCHLD
> +my $counter = 0;
> +while (not $exited and $counter < 60) {
> +        sleep 1;
> +        $counter = $counter + 1;
> +}
> +
>  if (!$exited) {
>          close($out);
>          die "Command did not exit after reading whole body";

I tried this fix and it made no difference to the hang on NonStop. I do not think this fixes the root cause as sleep was never an issue and SIGCHLD was not missed in any test I conducted. Maybe on another platform it is required.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
  2019-02-18 20:54       ` Randall S. Becker
@ 2019-02-18 20:59         ` Max Kirillov
  0 siblings, 0 replies; 13+ messages in thread
From: Max Kirillov @ 2019-02-18 20:59 UTC (permalink / raw)
  To: Randall S. Becker
  Cc: 'Max Kirillov', 'SZEDER Gábor', git,
	'Johannes Schindelin', 'Junio C Hamano'

On Mon, Feb 18, 2019 at 03:54:27PM -0500, Randall S. Becker wrote:
> On February 18, 2019 15:50, Max Kirillov wrote:
> > To: SZEDER Gábor <szeder.dev@gmail.com>; git@vger.kernel.org
> > Cc: Max Kirillov <max@max630.net>; Johannes Schindelin
> > <Johannes.Schindelin@gmx.de>; Randall S. Becker
> > <rsbecker@nexbridge.com>; 'Junio C Hamano' <gitster@pobox.com>
> > Subject: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
> > 
> > If was found during stress-test run that a test may hang by 60 seconds.
> > It supposedly happens because SIGCHILD was received before sleep has
> > started.
> > 
> > Fix by looping by smaller chunks, checking $exited after each of them.
> > Then lost SIGCHILD would not cause longer delay than 1 second.
> > 
> > Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> > Signed-off-by: Max Kirillov <max@max630.net>
> > ---
> > Submitting as proper patch. Note: I believe it does not relate to other issues
> > discussed in this thread.
> >  t/t5562/invoke-with-content-length.pl | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-
> > content-length.pl
> > index 0943474af2..257e280e3b 100644
> > --- a/t/t5562/invoke-with-content-length.pl
> > +++ b/t/t5562/invoke-with-content-length.pl
> > @@ -29,7 +29,12 @@
> >  }
> >  print $out $body_data or die "Cannot write data: $!";
> > 
> > -sleep 60; # is interrupted by SIGCHLD
> > +my $counter = 0;
> > +while (not $exited and $counter < 60) {
> > +        sleep 1;
> > +        $counter = $counter + 1;
> > +}
> > +
> >  if (!$exited) {
> >          close($out);
> >          die "Command did not exit after reading whole body";
> 
> I tried this fix and it made no difference to the hang on
> NonStop. I do not think this fixes the root cause as sleep
> was never an issue and SIGCHLD was not missed in any test
> I conducted. Maybe on another platform it is required.

Correct, as I said it should not be related.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] t5562: chunked sleep to avoid lost SIGCHILD
  2019-02-18 20:50     ` [PATCH] t5562: chunked sleep to avoid lost SIGCHILD Max Kirillov
  2019-02-18 20:54       ` Randall S. Becker
@ 2019-02-19 18:38       ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2019-02-19 18:38 UTC (permalink / raw)
  To: Max Kirillov
  Cc: SZEDER Gábor, git, Johannes Schindelin, Randall S. Becker

Max Kirillov <max@max630.net> writes:

> If was found during stress-test run that a test may hang by 60 seconds.
> It supposedly happens because SIGCHILD was received before sleep has
> started.
>
> Fix by looping by smaller chunks, checking $exited after each of them.
> Then lost SIGCHILD would not cause longer delay than 1 second.
>
> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Max Kirillov <max@max630.net>
> ---
> Submitting as proper patch. Note: I believe it does not relate to other issues
> discussed in this thread.
>  t/t5562/invoke-with-content-length.pl | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/t/t5562/invoke-with-content-length.pl b/t/t5562/invoke-with-content-length.pl
> index 0943474af2..257e280e3b 100644
> --- a/t/t5562/invoke-with-content-length.pl
> +++ b/t/t5562/invoke-with-content-length.pl
> @@ -29,7 +29,12 @@
>  }
>  print $out $body_data or die "Cannot write data: $!";
>  
> -sleep 60; # is interrupted by SIGCHLD

Ah, of course.  If SIGCHLD interrupts, sets $existed in the handler,
then we won't go back to sleep.  But if the signal came before the
sleep starts, we spend full 60 seconds here before we check $exited.

Makes sense.

> +my $counter = 0;
> +while (not $exited and $counter < 60) {
> +        sleep 1;
> +        $counter = $counter + 1;
> +}
> +
>  if (!$exited) {
>          close($out);
>          die "Command did not exit after reading whole body";

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-02-19 18:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14 15:04 [ANNOUNCE] Git v2.21.0-rc1 (NonStop Results) Randall S. Becker
2019-02-14 19:56 ` Junio C Hamano
2019-02-14 21:36 ` Johannes Schindelin
2019-02-14 22:25   ` Randall S. Becker
2019-02-15 13:02   ` SZEDER Gábor
2019-02-15 13:49     ` Randall S. Becker
2019-02-15 20:37     ` Max Kirillov
2019-02-15 21:13       ` Randall S. Becker
2019-02-16  8:26         ` Max Kirillov
2019-02-18 20:50     ` [PATCH] t5562: chunked sleep to avoid lost SIGCHILD Max Kirillov
2019-02-18 20:54       ` Randall S. Becker
2019-02-18 20:59         ` Max Kirillov
2019-02-19 18:38       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).