git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* t5570-git-daemon fails with SIGPIPE on OSX
@ 2018-08-06 15:11 SZEDER Gábor
  2018-08-06 15:31 ` SZEDER Gábor
  2018-08-14 22:32 ` Jeff King
  0 siblings, 2 replies; 21+ messages in thread
From: SZEDER Gábor @ 2018-08-06 15:11 UTC (permalink / raw)
  To: Git mailing list; +Cc: Jeff King, Clemens Buchacher

Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
[1].  Since then OSX build jobs fail rather frequently because of a
SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
a real bug in Git affecting other platforms as well, but these tests
are too lax to catch it.

What it boils down to is this sequence:

  - The test first prepares a repository containing a corrupt pack,
    ready to be server via 'git daemon'.

  - Then the test runs 'test_must_fail git fetch ....', which connects
    to 'git daemon', which forks 'git upload-pack', which then
    advertises refs (only HEAD) and capabilities.  So far so good.

  - 'git fetch' eventually calls fetch-pack.c:find_common().  The
    first half of this function assembles a request consisting of a
    want and a flush pkt-line, and sends it via a send_request() call.

    At this point the scheduling becomes important: let's suppose that
    fetch is slow and upload-pack is fast.

  - 'git upload-pack' receives the request, parses the want line,
    notices the corrupt pack, responds with an 'ERR upload-pack: not
    our ref' pkt-line, and die()s right away.

  - 'git fetch' finally approaches the end of the function, where it
    attempts to send a done pkt-line via another send_request() call
    through the now closing TCP socket.

  - What happens now seems to depend on the platform:

    - On Linux, both on my machine and on Travis CI, it shows textbook
      example behaviour: write() returns with error and sets errno to
      ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
      die()s with 'fatal: write error: Connection reset by peer', and
      doesn't show the error send by 'git upload-pack'; how could it,
      it doesn't even get as far to receive upload-pack's ERR
      pkt-line.

      The test only checks that 'git fetch' fails, but it doesn't
      check whether it failed with the right error message, so the
      test still succeeds.  Had it checked the error message as well,
      we most likely had noticed this issue already, it doesn't happen
      all that rarely.

    - On the new OSX images with XCode 9.4 on Travis CI the write()
      triggers SIGPIPE right away, and 'test_must_fail' notices it and
      fails the test.  I couldn't see any sign of an ECONNRESET or any
      other error that we could act upon to avoid the SIGPIPE.

    - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
      ECONNRESET, but sending the request actually succeeds even
      though there is no process on the other end of the socket
      anymore.  'git fetch' then simply continues execution, reads and
      parses the ERR pkt-line, and then dies()s with 'fatal: remote
      error: upload-pack: not our ref'.  So, on the face of it, it
      shows the desired behaviour, but I have no idea how that write()
      could succeed instead of returning error.

I don't know what happens on a real Mac as I don't have access to one;
I figured out all the above by enabling packet tracing, adding a
couple of well placed tracing printf() and sleep() calls, running a
bunch of builds on Travis CI, and looking through their logs.  But
without access to a debugger and netstat and what not I can't really
go any further.  So I would now happily pass the baton to those who
have a Mac and know a thing or two about its porting issues to first
check whether OSX on a real Mac shows the same behaviour as it does in
Travis CI's virtualized(?) environment.  And then they can pass the
baton to those who know all the intricacies of the pack protocol and
its implementation to decide what to do with this issue.

For a mostly reliable reproduction recipe you might want to fetch this
branch:

  https://github.com/szeder/git t5570-git-daemon-sigpipe

and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'


Have fun! ;)


1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce

2 - On git.git's master:
    https://travis-ci.org/git/git/jobs/411517552#L2717

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2018-08-06 15:11 t5570-git-daemon fails with SIGPIPE on OSX SZEDER Gábor
@ 2018-08-06 15:31 ` SZEDER Gábor
  2019-02-08  8:32   ` Johannes Schindelin
  2018-08-14 22:32 ` Jeff King
  1 sibling, 1 reply; 21+ messages in thread
From: SZEDER Gábor @ 2018-08-06 15:31 UTC (permalink / raw)
  To: Git mailing list; +Cc: Jeff King, Clemens Buchacher

[Resending with Clemens' last used email address.
Clemens, please consider sending a patch to update our .mailmap file.]


On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> [1].  Since then OSX build jobs fail rather frequently because of a
> SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> a real bug in Git affecting other platforms as well, but these tests
> are too lax to catch it.
>
> What it boils down to is this sequence:
>
>   - The test first prepares a repository containing a corrupt pack,
>     ready to be server via 'git daemon'.
>
>   - Then the test runs 'test_must_fail git fetch ....', which connects
>     to 'git daemon', which forks 'git upload-pack', which then
>     advertises refs (only HEAD) and capabilities.  So far so good.
>
>   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
>     first half of this function assembles a request consisting of a
>     want and a flush pkt-line, and sends it via a send_request() call.
>
>     At this point the scheduling becomes important: let's suppose that
>     fetch is slow and upload-pack is fast.
>
>   - 'git upload-pack' receives the request, parses the want line,
>     notices the corrupt pack, responds with an 'ERR upload-pack: not
>     our ref' pkt-line, and die()s right away.
>
>   - 'git fetch' finally approaches the end of the function, where it
>     attempts to send a done pkt-line via another send_request() call
>     through the now closing TCP socket.
>
>   - What happens now seems to depend on the platform:
>
>     - On Linux, both on my machine and on Travis CI, it shows textbook
>       example behaviour: write() returns with error and sets errno to
>       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
>       die()s with 'fatal: write error: Connection reset by peer', and
>       doesn't show the error send by 'git upload-pack'; how could it,
>       it doesn't even get as far to receive upload-pack's ERR
>       pkt-line.
>
>       The test only checks that 'git fetch' fails, but it doesn't
>       check whether it failed with the right error message, so the
>       test still succeeds.  Had it checked the error message as well,
>       we most likely had noticed this issue already, it doesn't happen
>       all that rarely.
>
>     - On the new OSX images with XCode 9.4 on Travis CI the write()
>       triggers SIGPIPE right away, and 'test_must_fail' notices it and
>       fails the test.  I couldn't see any sign of an ECONNRESET or any
>       other error that we could act upon to avoid the SIGPIPE.
>
>     - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
>       ECONNRESET, but sending the request actually succeeds even
>       though there is no process on the other end of the socket
>       anymore.  'git fetch' then simply continues execution, reads and
>       parses the ERR pkt-line, and then dies()s with 'fatal: remote
>       error: upload-pack: not our ref'.  So, on the face of it, it
>       shows the desired behaviour, but I have no idea how that write()
>       could succeed instead of returning error.
>
> I don't know what happens on a real Mac as I don't have access to one;
> I figured out all the above by enabling packet tracing, adding a
> couple of well placed tracing printf() and sleep() calls, running a
> bunch of builds on Travis CI, and looking through their logs.  But
> without access to a debugger and netstat and what not I can't really
> go any further.  So I would now happily pass the baton to those who
> have a Mac and know a thing or two about its porting issues to first
> check whether OSX on a real Mac shows the same behaviour as it does in
> Travis CI's virtualized(?) environment.  And then they can pass the
> baton to those who know all the intricacies of the pack protocol and
> its implementation to decide what to do with this issue.
>
> For a mostly reliable reproduction recipe you might want to fetch this
> branch:
>
>   https://github.com/szeder/git t5570-git-daemon-sigpipe
>
> and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'
>
>
> Have fun! ;)
>
>
> 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce
>
> 2 - On git.git's master:
>     https://travis-ci.org/git/git/jobs/411517552#L2717

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2018-08-06 15:11 t5570-git-daemon fails with SIGPIPE on OSX SZEDER Gábor
  2018-08-06 15:31 ` SZEDER Gábor
@ 2018-08-14 22:32 ` Jeff King
  2018-08-14 22:37   ` Jeff King
  2019-02-08  9:02   ` Johannes Schindelin
  1 sibling, 2 replies; 21+ messages in thread
From: Jeff King @ 2018-08-14 22:32 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git mailing list, Clemens Buchacher

On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:

>   - 'git upload-pack' receives the request, parses the want line,
>     notices the corrupt pack, responds with an 'ERR upload-pack: not
>     our ref' pkt-line, and die()s right away.
> 
>   - 'git fetch' finally approaches the end of the function, where it
>     attempts to send a done pkt-line via another send_request() call
>     through the now closing TCP socket.
> 
>   - What happens now seems to depend on the platform:
> 
>     - On Linux, both on my machine and on Travis CI, it shows textbook
>       example behaviour: write() returns with error and sets errno to
>       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
>       die()s with 'fatal: write error: Connection reset by peer', and
>       doesn't show the error send by 'git upload-pack'; how could it,
>       it doesn't even get as far to receive upload-pack's ERR
>       pkt-line.
> 
>       The test only checks that 'git fetch' fails, but it doesn't
>       check whether it failed with the right error message, so the
>       test still succeeds.  Had it checked the error message as well,
>       we most likely had noticed this issue already, it doesn't happen
>       all that rarely.

Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
was the message you got from git-daemon if it couldn't start the
requested sub-process. It was only later in bdb31eada7 (upload-pack:
report "not our ref" to client, 2017-02-23) that we started sending
them. So I think that is why it does not check the error message: it is
not expecting that case at all (and it is not actually interesting here,
as the real problem is that the remote side is corrupt, but it sadly
does not say anything so useful).

I think that's somewhat tangential, though. The root of the issue is
this:

>     - On the new OSX images with XCode 9.4 on Travis CI the write()
>       triggers SIGPIPE right away, and 'test_must_fail' notices it and
>       fails the test.  I couldn't see any sign of an ECONNRESET or any
>       other error that we could act upon to avoid the SIGPIPE.

Right, as soon as we get SIGPIPE we can't offer any useful message,
because we're dead. I would argue that fetch should simply turn off
SIGPIPE entirely, and rely on getting EPIPE from write(). But since
we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
death!

So we'd probably also want to teach it to use a real write_in_full(),
and then output a more useful message in this case. write_or_die()
really does produce bad messages regardless, because it doesn't know
what it's writing to.

That would give us a baby step in the right direction, because at least
we'd always be doing a controlled die() then. And then the next step
would be to show the remote error message (even though it's not actually
useful in this case, in theory upload-pack could generate something
better). And that would mean turning the die() on write into an attempt
to drain any ERR messages before either dying or returning an error up
the stack.

I suspect the (largely untested) patch below would make your test
problems go away. Or instead, we could simply add sigpipe=ok to the
test_must_fail invocation, but I agree with you that the current
behavior on OS X is not ideal (the user sees no error message).

-Peff

diff --git a/fetch-pack.c b/fetch-pack.c
index 5714bcbddd..3e80604562 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args,
 	if (args->stateless_rpc) {
 		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
 		packet_flush(fd);
-	} else
-		write_or_die(fd, buf->buf, buf->len);
+	} else {
+		if (write_in_full(fd, buf->buf, buf->len) < 0)
+			die_errno("unable to write to remote");
+	}
 }
 
 static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
@@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
 
 	/* Send request */
 	packet_buf_flush(&req_buf);
-	write_or_die(fd_out, req_buf.buf, req_buf.len);
+	if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
+		die_errno("unable to write request to remote");
 
 	strbuf_release(&req_buf);
 	return ret;
diff --git a/pkt-line.c b/pkt-line.c
index a593c08aad..450d0801b1 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
 void packet_flush(int fd)
 {
 	packet_trace("0000", 4, 1);
-	write_or_die(fd, "0000", 4);
+	if (write_in_full(fd, "0000", 4) < 0)
+		die_errno("unable to write flush packet");
 }
 
 void packet_delim(int fd)
 {
 	packet_trace("0001", 4, 1);
-	write_or_die(fd, "0001", 4);
+	if (write_in_full(fd, "0000", 4) < 0)
+		die_errno("unable to write delim packet");
 }
 
 int packet_flush_gently(int fd)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2018-08-14 22:32 ` Jeff King
@ 2018-08-14 22:37   ` Jeff King
  2019-02-08  9:02   ` Johannes Schindelin
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff King @ 2018-08-14 22:37 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git mailing list, Clemens Buchacher

On Tue, Aug 14, 2018 at 06:32:47PM -0400, Jeff King wrote:

> I suspect the (largely untested) patch below would make your test
> problems go away. Or instead, we could simply add sigpipe=ok to the
> test_must_fail invocation, but I agree with you that the current
> behavior on OS X is not ideal (the user sees no error message).

Whoops, that patch of course misses adding "sigchain_push(SIGPIPE,
SIG_IGN)" during the fetch-pack operation. I'll leave that as an
exercise to the reader. ;)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2018-08-06 15:31 ` SZEDER Gábor
@ 2019-02-08  8:32   ` Johannes Schindelin
  2019-02-08 12:54     ` SZEDER Gábor
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2019-02-08  8:32 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git mailing list, Jeff King, Clemens Buchacher

[-- Attachment #1: Type: text/plain, Size: 8157 bytes --]

Team,

On Mon, 6 Aug 2018, SZEDER Gábor wrote:

> [Resending with Clemens' last used email address.
> Clemens, please consider sending a patch to update our .mailmap file.]
> 
> 
> On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
> >
> > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> > [1].  Since then OSX build jobs fail rather frequently because of a
> > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> > corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> > a real bug in Git affecting other platforms as well, but these tests
> > are too lax to catch it.

I am seeing this very frequently now, as it feels like failing in the
Azure Pipeline about half of the time.

The symptom is this:

-- snip --
++ cp -R '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_pack.git' '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_bad2.git'
++ cd '/Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo/repo_bad2.git'
+++ ls objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ p=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ chmod u+w objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
++ printf %0256d 0
++ dd of=objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx bs=256 count=1 seek=1 conv=notrunc
1+0 records in
1+0 records out
256 bytes transferred in 0.000018 secs (14316558 bytes/sec)
++ mkdir repo_bad2.git
++ cd repo_bad2.git
++ git --bare init
Initialized empty Git repository in /Users/vsts/agent/2.146.0/work/1/s/t/trash directory.t5570-git-daemon/repo_bad2.git/
++ test_must_fail git --bare fetch git://127.0.0.1:5570/repo_bad2.git
++ case "$1" in
++ _test_ok=
++ git --bare fetch git://127.0.0.1:5570/repo_bad2.git
[13879] Connection from 127.0.0.1:60504
[13879] Extended attribute "host": 127.0.0.1:5570
[13879] Extended attribute "protocol": version=0
[13879] Request upload-pack for '/repo_bad2.git'
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: refs/heads/master does not point to a valid object!
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: refs/heads/other does not point to a valid object!
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] error: non-monotonic index ./objects/pack/pack-b2175ab283c0abe358895777ca01c36e88a92e35.idx
[13879] fatal: git upload-pack: not our ref b5c2e4226db03597a64815fd226648510b22ba40
[11833] [13879] Disconnected (with error)
++ exit_code=141
++ test 141 -eq 0
++ test_match_signal 13 141
++ test 141 = 141
++ return 0
++ list_contains '' sigpipe
++ case ",$1," in
++ return 1
++ test 141 -gt 129
++ test 141 -le 192
++ echo 'test_must_fail: died by signal 13: git --bare fetch git://127.0.0.1:5570/repo_bad2.git'
test_must_fail: died by signal 13: git --bare fetch git://127.0.0.1:5570/repo_bad2.git
++ return 1
error: last command exited with $?=1
not ok 10 - fetch notices corrupt idx
#	
#		cp -R "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_pack.git "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
#		(cd "$GIT_DAEMON_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
#		 p=$(ls objects/pack/pack-*.idx) &&
#		 chmod u+w $p &&
#		 printf %0256d 0 | dd of=$p bs=256 count=1 seek=1 conv=notrunc
#		) &&
#		mkdir repo_bad2.git &&
#		(cd repo_bad2.git &&
#		 git --bare init &&
#		 test_must_fail git --bare fetch "$GIT_DAEMON_URL/repo_bad2.git" &&
#		 test 0 = $(ls objects/pack | wc -l)
#		)
#	
-- snap --

Any ideas how to fix this test, anyone?

Ciao,
Dscho

> >
> > What it boils down to is this sequence:
> >
> >   - The test first prepares a repository containing a corrupt pack,
> >     ready to be server via 'git daemon'.
> >
> >   - Then the test runs 'test_must_fail git fetch ....', which connects
> >     to 'git daemon', which forks 'git upload-pack', which then
> >     advertises refs (only HEAD) and capabilities.  So far so good.
> >
> >   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
> >     first half of this function assembles a request consisting of a
> >     want and a flush pkt-line, and sends it via a send_request() call.
> >
> >     At this point the scheduling becomes important: let's suppose that
> >     fetch is slow and upload-pack is fast.
> >
> >   - 'git upload-pack' receives the request, parses the want line,
> >     notices the corrupt pack, responds with an 'ERR upload-pack: not
> >     our ref' pkt-line, and die()s right away.
> >
> >   - 'git fetch' finally approaches the end of the function, where it
> >     attempts to send a done pkt-line via another send_request() call
> >     through the now closing TCP socket.
> >
> >   - What happens now seems to depend on the platform:
> >
> >     - On Linux, both on my machine and on Travis CI, it shows textbook
> >       example behaviour: write() returns with error and sets errno to
> >       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> >       die()s with 'fatal: write error: Connection reset by peer', and
> >       doesn't show the error send by 'git upload-pack'; how could it,
> >       it doesn't even get as far to receive upload-pack's ERR
> >       pkt-line.
> >
> >       The test only checks that 'git fetch' fails, but it doesn't
> >       check whether it failed with the right error message, so the
> >       test still succeeds.  Had it checked the error message as well,
> >       we most likely had noticed this issue already, it doesn't happen
> >       all that rarely.
> >
> >     - On the new OSX images with XCode 9.4 on Travis CI the write()
> >       triggers SIGPIPE right away, and 'test_must_fail' notices it and
> >       fails the test.  I couldn't see any sign of an ECONNRESET or any
> >       other error that we could act upon to avoid the SIGPIPE.
> >
> >     - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
> >       ECONNRESET, but sending the request actually succeeds even
> >       though there is no process on the other end of the socket
> >       anymore.  'git fetch' then simply continues execution, reads and
> >       parses the ERR pkt-line, and then dies()s with 'fatal: remote
> >       error: upload-pack: not our ref'.  So, on the face of it, it
> >       shows the desired behaviour, but I have no idea how that write()
> >       could succeed instead of returning error.
> >
> > I don't know what happens on a real Mac as I don't have access to one;
> > I figured out all the above by enabling packet tracing, adding a
> > couple of well placed tracing printf() and sleep() calls, running a
> > bunch of builds on Travis CI, and looking through their logs.  But
> > without access to a debugger and netstat and what not I can't really
> > go any further.  So I would now happily pass the baton to those who
> > have a Mac and know a thing or two about its porting issues to first
> > check whether OSX on a real Mac shows the same behaviour as it does in
> > Travis CI's virtualized(?) environment.  And then they can pass the
> > baton to those who know all the intricacies of the pack protocol and
> > its implementation to decide what to do with this issue.
> >
> > For a mostly reliable reproduction recipe you might want to fetch this
> > branch:
> >
> >   https://github.com/szeder/git t5570-git-daemon-sigpipe
> >
> > and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'
> >
> >
> > Have fun! ;)
> >
> >
> > 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce
> >
> > 2 - On git.git's master:
> >     https://travis-ci.org/git/git/jobs/411517552#L2717
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2018-08-14 22:32 ` Jeff King
  2018-08-14 22:37   ` Jeff King
@ 2019-02-08  9:02   ` Johannes Schindelin
  2019-02-08  9:28     ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2019-02-08  9:02 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

[-- Attachment #1: Type: text/plain, Size: 5665 bytes --]

Hi Peff,

I just had a look at the patch you provided below (for some reason, my
previous search on public-inbox only turned up Gábor's mail to which you
responded).

Admittedly, I do not really understand all aspects of it, but it applies,
still, and I kicked off a stress test here:

	https://dev.azure.com/git/git/_build/results?buildId=338

It seems that your patch fixes that t5570 flakiness on macOS, and more
importantly, addresses an important issue on macOS.

Will play a bit more with it and keep you posted.

Ciao,
Dscho

On Tue, 14 Aug 2018, Jeff King wrote:

> On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:
> 
> >   - 'git upload-pack' receives the request, parses the want line,
> >     notices the corrupt pack, responds with an 'ERR upload-pack: not
> >     our ref' pkt-line, and die()s right away.
> > 
> >   - 'git fetch' finally approaches the end of the function, where it
> >     attempts to send a done pkt-line via another send_request() call
> >     through the now closing TCP socket.
> > 
> >   - What happens now seems to depend on the platform:
> > 
> >     - On Linux, both on my machine and on Travis CI, it shows textbook
> >       example behaviour: write() returns with error and sets errno to
> >       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> >       die()s with 'fatal: write error: Connection reset by peer', and
> >       doesn't show the error send by 'git upload-pack'; how could it,
> >       it doesn't even get as far to receive upload-pack's ERR
> >       pkt-line.
> > 
> >       The test only checks that 'git fetch' fails, but it doesn't
> >       check whether it failed with the right error message, so the
> >       test still succeeds.  Had it checked the error message as well,
> >       we most likely had noticed this issue already, it doesn't happen
> >       all that rarely.
> 
> Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
> was the message you got from git-daemon if it couldn't start the
> requested sub-process. It was only later in bdb31eada7 (upload-pack:
> report "not our ref" to client, 2017-02-23) that we started sending
> them. So I think that is why it does not check the error message: it is
> not expecting that case at all (and it is not actually interesting here,
> as the real problem is that the remote side is corrupt, but it sadly
> does not say anything so useful).
> 
> I think that's somewhat tangential, though. The root of the issue is
> this:
> 
> >     - On the new OSX images with XCode 9.4 on Travis CI the write()
> >       triggers SIGPIPE right away, and 'test_must_fail' notices it and
> >       fails the test.  I couldn't see any sign of an ECONNRESET or any
> >       other error that we could act upon to avoid the SIGPIPE.
> 
> Right, as soon as we get SIGPIPE we can't offer any useful message,
> because we're dead. I would argue that fetch should simply turn off
> SIGPIPE entirely, and rely on getting EPIPE from write(). But since
> we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
> death!
> 
> So we'd probably also want to teach it to use a real write_in_full(),
> and then output a more useful message in this case. write_or_die()
> really does produce bad messages regardless, because it doesn't know
> what it's writing to.
> 
> That would give us a baby step in the right direction, because at least
> we'd always be doing a controlled die() then. And then the next step
> would be to show the remote error message (even though it's not actually
> useful in this case, in theory upload-pack could generate something
> better). And that would mean turning the die() on write into an attempt
> to drain any ERR messages before either dying or returning an error up
> the stack.
> 
> I suspect the (largely untested) patch below would make your test
> problems go away. Or instead, we could simply add sigpipe=ok to the
> test_must_fail invocation, but I agree with you that the current
> behavior on OS X is not ideal (the user sees no error message).
> 
> -Peff
> 
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 5714bcbddd..3e80604562 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args,
>  	if (args->stateless_rpc) {
>  		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
>  		packet_flush(fd);
> -	} else
> -		write_or_die(fd, buf->buf, buf->len);
> +	} else {
> +		if (write_in_full(fd, buf->buf, buf->len) < 0)
> +			die_errno("unable to write to remote");
> +	}
>  }
>  
>  static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
> @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
>  
>  	/* Send request */
>  	packet_buf_flush(&req_buf);
> -	write_or_die(fd_out, req_buf.buf, req_buf.len);
> +	if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
> +		die_errno("unable to write request to remote");
>  
>  	strbuf_release(&req_buf);
>  	return ret;
> diff --git a/pkt-line.c b/pkt-line.c
> index a593c08aad..450d0801b1 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
>  void packet_flush(int fd)
>  {
>  	packet_trace("0000", 4, 1);
> -	write_or_die(fd, "0000", 4);
> +	if (write_in_full(fd, "0000", 4) < 0)
> +		die_errno("unable to write flush packet");
>  }
>  
>  void packet_delim(int fd)
>  {
>  	packet_trace("0001", 4, 1);
> -	write_or_die(fd, "0001", 4);
> +	if (write_in_full(fd, "0000", 4) < 0)
> +		die_errno("unable to write delim packet");
>  }
>  
>  int packet_flush_gently(int fd)
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-02-08  9:02   ` Johannes Schindelin
@ 2019-02-08  9:28     ` Johannes Schindelin
  2019-02-08 19:54       ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2019-02-08  9:28 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

[-- Attachment #1: Type: text/plain, Size: 6837 bytes --]

Hi Peff,

On Fri, 8 Feb 2019, Johannes Schindelin wrote:

> I just had a look at the patch you provided below (for some reason, my
> previous search on public-inbox only turned up Gábor's mail to which you
> responded).
> 
> Admittedly, I do not really understand all aspects of it, but it applies,
> still, and I kicked off a stress test here:
> 
> 	https://dev.azure.com/git/git/_build/results?buildId=338
> 
> It seems that your patch fixes that t5570 flakiness on macOS, and more
> importantly, addresses an important issue on macOS.
> 
> Will play a bit more with it and keep you posted.

Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
the script *succeed* when it fails?

I say that because I wanted to make sure that your patch fixes things and
reverted your change and started another build, which succeeded. So I
started another build, then another build, and they all succeeded. Only
then it dawned on me that I had not looked at the *logs*. And they all
still report the same issue, even with your patch:

https://dev.azure.com/git/git/_build/results?buildId=338&view=logs&jobId=51041795-01c5-57f3-5561-107b6b9e51a6&taskId=fadc714a-a906-5cf2-cc7a-335e443ad2f8&lineStart=1402&lineEnd=1505&colStart=1&colEnd=32

(You will have to scroll all the way down, or press Ctrl+End, to see that
"fetch notices corrupt pack" is failing.)

So I am afraid that your patch does not fix the issue nor does it work
around it.

Ciao,
Dscho

> On Tue, 14 Aug 2018, Jeff King wrote:
> 
> > On Mon, Aug 06, 2018 at 05:11:13PM +0200, SZEDER Gábor wrote:
> > 
> > >   - 'git upload-pack' receives the request, parses the want line,
> > >     notices the corrupt pack, responds with an 'ERR upload-pack: not
> > >     our ref' pkt-line, and die()s right away.
> > > 
> > >   - 'git fetch' finally approaches the end of the function, where it
> > >     attempts to send a done pkt-line via another send_request() call
> > >     through the now closing TCP socket.
> > > 
> > >   - What happens now seems to depend on the platform:
> > > 
> > >     - On Linux, both on my machine and on Travis CI, it shows textbook
> > >       example behaviour: write() returns with error and sets errno to
> > >       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> > >       die()s with 'fatal: write error: Connection reset by peer', and
> > >       doesn't show the error send by 'git upload-pack'; how could it,
> > >       it doesn't even get as far to receive upload-pack's ERR
> > >       pkt-line.
> > > 
> > >       The test only checks that 'git fetch' fails, but it doesn't
> > >       check whether it failed with the right error message, so the
> > >       test still succeeds.  Had it checked the error message as well,
> > >       we most likely had noticed this issue already, it doesn't happen
> > >       all that rarely.
> > 
> > Hmm. Traditionally we did not send ERR as part of upload-pack at all. It
> > was the message you got from git-daemon if it couldn't start the
> > requested sub-process. It was only later in bdb31eada7 (upload-pack:
> > report "not our ref" to client, 2017-02-23) that we started sending
> > them. So I think that is why it does not check the error message: it is
> > not expecting that case at all (and it is not actually interesting here,
> > as the real problem is that the remote side is corrupt, but it sadly
> > does not say anything so useful).
> > 
> > I think that's somewhat tangential, though. The root of the issue is
> > this:
> > 
> > >     - On the new OSX images with XCode 9.4 on Travis CI the write()
> > >       triggers SIGPIPE right away, and 'test_must_fail' notices it and
> > >       fails the test.  I couldn't see any sign of an ECONNRESET or any
> > >       other error that we could act upon to avoid the SIGPIPE.
> > 
> > Right, as soon as we get SIGPIPE we can't offer any useful message,
> > because we're dead. I would argue that fetch should simply turn off
> > SIGPIPE entirely, and rely on getting EPIPE from write(). But since
> > we're in write_or_die(), it actually turns EPIPE back into a SIGPIPE
> > death!
> > 
> > So we'd probably also want to teach it to use a real write_in_full(),
> > and then output a more useful message in this case. write_or_die()
> > really does produce bad messages regardless, because it doesn't know
> > what it's writing to.
> > 
> > That would give us a baby step in the right direction, because at least
> > we'd always be doing a controlled die() then. And then the next step
> > would be to show the remote error message (even though it's not actually
> > useful in this case, in theory upload-pack could generate something
> > better). And that would mean turning the die() on write into an attempt
> > to drain any ERR messages before either dying or returning an error up
> > the stack.
> > 
> > I suspect the (largely untested) patch below would make your test
> > problems go away. Or instead, we could simply add sigpipe=ok to the
> > test_must_fail invocation, but I agree with you that the current
> > behavior on OS X is not ideal (the user sees no error message).
> > 
> > -Peff
> > 
> > diff --git a/fetch-pack.c b/fetch-pack.c
> > index 5714bcbddd..3e80604562 100644
> > --- a/fetch-pack.c
> > +++ b/fetch-pack.c
> > @@ -188,8 +188,10 @@ static void send_request(struct fetch_pack_args *args,
> >  	if (args->stateless_rpc) {
> >  		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
> >  		packet_flush(fd);
> > -	} else
> > -		write_or_die(fd, buf->buf, buf->len);
> > +	} else {
> > +		if (write_in_full(fd, buf->buf, buf->len) < 0)
> > +			die_errno("unable to write to remote");
> > +	}
> >  }
> >  
> >  static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
> > @@ -1167,7 +1169,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
> >  
> >  	/* Send request */
> >  	packet_buf_flush(&req_buf);
> > -	write_or_die(fd_out, req_buf.buf, req_buf.len);
> > +	if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
> > +		die_errno("unable to write request to remote");
> >  
> >  	strbuf_release(&req_buf);
> >  	return ret;
> > diff --git a/pkt-line.c b/pkt-line.c
> > index a593c08aad..450d0801b1 100644
> > --- a/pkt-line.c
> > +++ b/pkt-line.c
> > @@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
> >  void packet_flush(int fd)
> >  {
> >  	packet_trace("0000", 4, 1);
> > -	write_or_die(fd, "0000", 4);
> > +	if (write_in_full(fd, "0000", 4) < 0)
> > +		die_errno("unable to write flush packet");
> >  }
> >  
> >  void packet_delim(int fd)
> >  {
> >  	packet_trace("0001", 4, 1);
> > -	write_or_die(fd, "0001", 4);
> > +	if (write_in_full(fd, "0000", 4) < 0)
> > +		die_errno("unable to write delim packet");
> >  }
> >  
> >  int packet_flush_gently(int fd)
> > 
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-02-08  8:32   ` Johannes Schindelin
@ 2019-02-08 12:54     ` SZEDER Gábor
  0 siblings, 0 replies; 21+ messages in thread
From: SZEDER Gábor @ 2019-02-08 12:54 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git mailing list, Jeff King, Clemens Buchacher

On Fri, Feb 08, 2019 at 09:32:32AM +0100, Johannes Schindelin wrote:
> Team,
> 
> On Mon, 6 Aug 2018, SZEDER Gábor wrote:
> 
> > [Resending with Clemens' last used email address.
> > Clemens, please consider sending a patch to update our .mailmap file.]
> > 
> > 
> > On Mon, Aug 6, 2018 at 5:11 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
> > >
> > > Travis CI changed its default OSX image to use XCode 9.4 on 2018-07-31
> > > [1].  Since then OSX build jobs fail rather frequently because of a
> > > SIGPIPE in the tests 'fetch notices corrupt pack' or 'fetch notices
> > > corrupt idx' in 't5570-git-daemon.sh' [2].  I think this is a symptom
> > > a real bug in Git affecting other platforms as well, but these tests
> > > are too lax to catch it.
> 
> I am seeing this very frequently now, as it feels like failing in the
> Azure Pipeline about half of the time.

I was wondering whether it's only an issue on Travis CI, and was
waiting whether you'll complain about it :)  Evidently it is not, but
I still would like to see a report about macOS running directly on
someone's Mac hardware, i.e. without all the CI/cloud/whatnot magic.

> Any ideas how to fix this test, anyone?

I'm afraid that this is not merely an issue with the test, but a
platform issue that we should work around somehow.  I would have
suggested to follow up on what Peff suggested, but I see that you
already did, and it didn't work out...

In the meantime, for lack of a better option, I started to skip the
two failure-prone tests in the OSX build jobs in my automated CI
builds with:

 -- >8 --

Subject: [PATCH] travis-ci: skip flaky tests in 't5570-git-daemon.sh' in OSX
 build jobs

See: https://public-inbox.org/git/CAM0VKj=MCS+cmOgzf_XyPeb+qZrFmuMH52-PV_NDMZA9X+rRoA@mail.gmail.com/

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 ci/lib-travisci.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ci/lib-travisci.sh b/ci/lib-travisci.sh
index 69dff4d1ec..1cd2c1db7d 100755
--- a/ci/lib-travisci.sh
+++ b/ci/lib-travisci.sh
@@ -121,7 +121,9 @@ osx-clang|osx-gcc)
 	# t9810 occasionally fails on Travis CI OS X
 	# t9816 occasionally fails with "TAP out of sequence errors" on
 	# Travis CI OS X
-	export GIT_SKIP_TESTS="t9810 t9816"
+	# In 't5570-git-daemon.sh', the tests 'fetch notices corrupt pack'
+	# and 'fetch notices corrupt idx' fail rather frequently.
+	export GIT_SKIP_TESTS="t9810 t9816 t5570.9 t5570.10"
 	;;
 GIT_TEST_GETTEXT_POISON)
 	export GIT_TEST_GETTEXT_POISON=YesPlease
-- 
2.20.1.940.g8404bb2d1a

 -- >8 --


> > > What it boils down to is this sequence:
> > >
> > >   - The test first prepares a repository containing a corrupt pack,
> > >     ready to be server via 'git daemon'.
> > >
> > >   - Then the test runs 'test_must_fail git fetch ....', which connects
> > >     to 'git daemon', which forks 'git upload-pack', which then
> > >     advertises refs (only HEAD) and capabilities.  So far so good.
> > >
> > >   - 'git fetch' eventually calls fetch-pack.c:find_common().  The
> > >     first half of this function assembles a request consisting of a
> > >     want and a flush pkt-line, and sends it via a send_request() call.
> > >
> > >     At this point the scheduling becomes important: let's suppose that
> > >     fetch is slow and upload-pack is fast.
> > >
> > >   - 'git upload-pack' receives the request, parses the want line,
> > >     notices the corrupt pack, responds with an 'ERR upload-pack: not
> > >     our ref' pkt-line, and die()s right away.
> > >
> > >   - 'git fetch' finally approaches the end of the function, where it
> > >     attempts to send a done pkt-line via another send_request() call
> > >     through the now closing TCP socket.
> > >
> > >   - What happens now seems to depend on the platform:
> > >
> > >     - On Linux, both on my machine and on Travis CI, it shows textbook
> > >       example behaviour: write() returns with error and sets errno to
> > >       ECONNRESET.  Since it happens in write_or_die(), 'git fetch'
> > >       die()s with 'fatal: write error: Connection reset by peer', and
> > >       doesn't show the error send by 'git upload-pack'; how could it,
> > >       it doesn't even get as far to receive upload-pack's ERR
> > >       pkt-line.
> > >
> > >       The test only checks that 'git fetch' fails, but it doesn't
> > >       check whether it failed with the right error message, so the
> > >       test still succeeds.  Had it checked the error message as well,
> > >       we most likely had noticed this issue already, it doesn't happen
> > >       all that rarely.
> > >
> > >     - On the new OSX images with XCode 9.4 on Travis CI the write()
> > >       triggers SIGPIPE right away, and 'test_must_fail' notices it and
> > >       fails the test.  I couldn't see any sign of an ECONNRESET or any
> > >       other error that we could act upon to avoid the SIGPIPE.
> > >
> > >     - On OSX with XCode 9.2 on Travis CI there is neither SIGPIPE, nor
> > >       ECONNRESET, but sending the request actually succeeds even
> > >       though there is no process on the other end of the socket
> > >       anymore.  'git fetch' then simply continues execution, reads and
> > >       parses the ERR pkt-line, and then dies()s with 'fatal: remote
> > >       error: upload-pack: not our ref'.  So, on the face of it, it
> > >       shows the desired behaviour, but I have no idea how that write()
> > >       could succeed instead of returning error.
> > >
> > > I don't know what happens on a real Mac as I don't have access to one;
> > > I figured out all the above by enabling packet tracing, adding a
> > > couple of well placed tracing printf() and sleep() calls, running a
> > > bunch of builds on Travis CI, and looking through their logs.  But
> > > without access to a debugger and netstat and what not I can't really
> > > go any further.  So I would now happily pass the baton to those who
> > > have a Mac and know a thing or two about its porting issues to first
> > > check whether OSX on a real Mac shows the same behaviour as it does in
> > > Travis CI's virtualized(?) environment.  And then they can pass the
> > > baton to those who know all the intricacies of the pack protocol and
> > > its implementation to decide what to do with this issue.
> > >
> > > For a mostly reliable reproduction recipe you might want to fetch this
> > > branch:
> > >
> > >   https://github.com/szeder/git t5570-git-daemon-sigpipe
> > >
> > > and then run 'make && cd t && ./t5570-git-daemon.sh -v -x'
> > >
> > >
> > > Have fun! ;)
> > >
> > >
> > > 1 - https://blog.travis-ci.com/2018-07-19-xcode9-4-default-announce
> > >
> > > 2 - On git.git's master:
> > >     https://travis-ci.org/git/git/jobs/411517552#L2717
> > 
> > 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-02-08  9:28     ` Johannes Schindelin
@ 2019-02-08 19:54       ` Jeff King
  2019-03-01 15:02         ` Johannes Schindelin
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-02-08 19:54 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote:

> Hi Peff,
> 
> On Fri, 8 Feb 2019, Johannes Schindelin wrote:
> 
> > I just had a look at the patch you provided below (for some reason, my
> > previous search on public-inbox only turned up Gábor's mail to which you
> > responded).
> > 
> > Admittedly, I do not really understand all aspects of it, but it applies,
> > still, and I kicked off a stress test here:
> > 
> > 	https://dev.azure.com/git/git/_build/results?buildId=338
> > 
> > It seems that your patch fixes that t5570 flakiness on macOS, and more
> > importantly, addresses an important issue on macOS.
> > 
> > Will play a bit more with it and keep you posted.
> 
> Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
> the script *succeed* when it fails?
> [...]
> So I am afraid that your patch does not fix the issue nor does it work
> around it.

I think that patch does the write_or_die conversion to handle EPIPE, but
it would still need to turn off SIGPIPE for the whole process.

So you'd also need to stick a:

  sigchain_push(SIGPIPE, SIG_IGN);

somewhere near the start of cmd_fetch(). (There may be a less
coarse-grained place to put it, but at this point I think we're just
trying to find out whether this approach even solves the problem).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-02-08 19:54       ` Jeff King
@ 2019-03-01 15:02         ` Johannes Schindelin
  2019-03-01 19:00           ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2019-03-01 15:02 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

[-- Attachment #1: Type: text/plain, Size: 2074 bytes --]

Hi Peff,

On Fri, 8 Feb 2019, Jeff King wrote:

> On Fri, Feb 08, 2019 at 10:28:12AM +0100, Johannes Schindelin wrote:
> 
> > On Fri, 8 Feb 2019, Johannes Schindelin wrote:
> > 
> > > I just had a look at the patch you provided below (for some reason, my
> > > previous search on public-inbox only turned up Gábor's mail to which you
> > > responded).
> > > 
> > > Admittedly, I do not really understand all aspects of it, but it applies,
> > > still, and I kicked off a stress test here:
> > > 
> > > 	https://dev.azure.com/git/git/_build/results?buildId=338
> > > 
> > > It seems that your patch fixes that t5570 flakiness on macOS, and more
> > > importantly, addresses an important issue on macOS.
> > > 
> > > Will play a bit more with it and keep you posted.
> > 
> > Alas, I was fooled. *Fooled*, I say. Apparently the --stress option makes
> > the script *succeed* when it fails?
> > [...]
> > So I am afraid that your patch does not fix the issue nor does it work
> > around it.
> 
> I think that patch does the write_or_die conversion to handle EPIPE, but
> it would still need to turn off SIGPIPE for the whole process.
> 
> So you'd also need to stick a:
> 
>   sigchain_push(SIGPIPE, SIG_IGN);
> 
> somewhere near the start of cmd_fetch(). (There may be a less
> coarse-grained place to put it, but at this point I think we're just
> trying to find out whether this approach even solves the problem).

This fixes it, it seems. I let the job run with `--stress=50` and even
after half an hour, it did not fail:

attempts ://git.visualstudio.com/git/_build/results?buildId=354

(I had to cancel it, I thought that `--stress=50` would stop trying after
50 runs, but I was obviously incorrect in that assumption...)

Reverting the patch made it fail within a hundred runs:

https://git.visualstudio.com/git/_build/results?buildId=356

So. Good, we have a diff that works. But you mentioned that you'd like to
put it somewhere else? I am a bit unfamiliar with the code paths of
`cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-03-01 15:02         ` Johannes Schindelin
@ 2019-03-01 19:00           ` Jeff King
  2019-03-02 21:21             ` Johannes Schindelin
  2019-03-03  1:21             ` t5570-git-daemon fails with SIGPIPE on OSX Junio C Hamano
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2019-03-01 19:00 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

On Fri, Mar 01, 2019 at 04:02:22PM +0100, Johannes Schindelin wrote:

> > I think that patch does the write_or_die conversion to handle EPIPE, but
> > it would still need to turn off SIGPIPE for the whole process.
> > 
> > So you'd also need to stick a:
> > 
> >   sigchain_push(SIGPIPE, SIG_IGN);
> > 
> > somewhere near the start of cmd_fetch(). (There may be a less
> > coarse-grained place to put it, but at this point I think we're just
> > trying to find out whether this approach even solves the problem).
> 
> This fixes it, it seems. I let the job run with `--stress=50` and even
> after half an hour, it did not fail:
> 
> attempts ://git.visualstudio.com/git/_build/results?buildId=354

Cool, I'm glad it works!

> (I had to cancel it, I thought that `--stress=50` would stop trying after
> 50 runs, but I was obviously incorrect in that assumption...)

No, that will do 50 simultaneous jobs. You want --stress-limit=50.

It might make more sense to have --stress-jobs, and make --stress=X set
the limit, as I think it's much more useful to set the limit than it is
to set the number of jobs. At least that's my experience.

> So. Good, we have a diff that works. But you mentioned that you'd like to
> put it somewhere else? I am a bit unfamiliar with the code paths of
> `cmd_fetch()`, so I would be hard pressed to find a better spot. Any hint?

Well, I'm of two minds there. If we want to make the minimum change,
then we'd just want to disable SIGPIPE while we're actually conversing
with the other side. So I guess that would be somewhere in do_fetch(),
before we start talking to the other side, and restoring it after we've
finished our half of the conversation (i.e., after we've done all the
want/have bits and we're receiving the pack). But that's actually pretty
awkward to do, because most of those details are handled under the hood
by the transport code.

So probably something like this is the least-terrible place to put it:

diff --git a/builtin/fetch.c b/builtin/fetch.c
index b620fd54b4..4ba63d5ac6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, const char **argv, int pru
 
 	sigchain_push_common(unlock_pack_on_signal);
 	atexit(unlock_pack);
+	sigchain_push(SIGPIPE, SIG_IGN);
 	exit_code = do_fetch(gtransport, &rs);
+	sigchain_pop(SIGPIPE);
 	refspec_clear(&rs);
 	transport_disconnect(gtransport);
 	gtransport = NULL;

That said, I actually think it's kind of pointless for git-fetch to use
SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive
program that spews output, and you'll get informed (forcefully) by the
OS if the process consuming your output stops listening. That makes
sense for programs like "git log", whose primary purpose is generating
output.

But for git-fetch, our primary purpose is receiving data and writing it
to disk. We should be checking all of our write()s already, and SIGPIPE
is just confusing. The only "big" output we generate is the status table
at the end. And even if that is going to a pipe that closes, I don't
think we'd want to fail the whole command (we'd want to finalize any
writes for what we just fetched, clean up after ourselves, etc).

So I'd actually be fine with just declaring that certain commands (like
fetch) just ignore SIGPIPE entirely.

-Peff

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-03-01 19:00           ` Jeff King
@ 2019-03-02 21:21             ` Johannes Schindelin
  2019-03-03 16:54               ` Jeff King
  2019-03-03  1:21             ` t5570-git-daemon fails with SIGPIPE on OSX Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2019-03-02 21:21 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Git mailing list, Clemens Buchacher

Hi Peff,

On Fri, 1 Mar 2019, Jeff King wrote:

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index b620fd54b4..4ba63d5ac6 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, const char **argv, int pru
>  
>  	sigchain_push_common(unlock_pack_on_signal);
>  	atexit(unlock_pack);
> +	sigchain_push(SIGPIPE, SIG_IGN);
>  	exit_code = do_fetch(gtransport, &rs);
> +	sigchain_pop(SIGPIPE);
>  	refspec_clear(&rs);
>  	transport_disconnect(gtransport);
>  	gtransport = NULL;

That indeed does the job:

	https://git.visualstudio.com/git/_build/results?buildId=358

> That said, I actually think it's kind of pointless for git-fetch to use
> SIGPIPE at all. The purpose of SIGPIPE is that you can write a naive
> program that spews output, and you'll get informed (forcefully) by the
> OS if the process consuming your output stops listening. That makes
> sense for programs like "git log", whose primary purpose is generating
> output.
> 
> But for git-fetch, our primary purpose is receiving data and writing it
> to disk. We should be checking all of our write()s already, and SIGPIPE
> is just confusing. The only "big" output we generate is the status table
> at the end. And even if that is going to a pipe that closes, I don't
> think we'd want to fail the whole command (we'd want to finalize any
> writes for what we just fetched, clean up after ourselves, etc).
> 
> So I'd actually be fine with just declaring that certain commands (like
> fetch) just ignore SIGPIPE entirely.

That's a bigger change than I'm comfortable with, so I'd like to go with
tge diff you gave above.

Do you want to turn these two patches into a proper patch series?
Otherwise I can take care of it, probably this Monday or Tuesday.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-03-01 19:00           ` Jeff King
  2019-03-02 21:21             ` Johannes Schindelin
@ 2019-03-03  1:21             ` Junio C Hamano
  2019-03-03 14:56               ` Johannes Schindelin
  1 sibling, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2019-03-03  1:21 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, SZEDER Gábor, Git mailing list,
	Clemens Buchacher

Jeff King <peff@peff.net> writes:

> But for git-fetch, our primary purpose is receiving data and writing it
> to disk. We should be checking all of our write()s already, and SIGPIPE
> is just confusing.

Yup, sounds like a very sensible argument.

> So I'd actually be fine with just declaring that certain commands (like
> fetch) just ignore SIGPIPE entirely.

Me too (to me, "actually be fine with" would mean "that's a larger
hammer alternative, which is OK, while the base solution is fine,
too").

Thanks.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-03-03  1:21             ` t5570-git-daemon fails with SIGPIPE on OSX Junio C Hamano
@ 2019-03-03 14:56               ` Johannes Schindelin
  0 siblings, 0 replies; 21+ messages in thread
From: Johannes Schindelin @ 2019-03-03 14:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, SZEDER Gábor, Git mailing list, Clemens Buchacher

Hi Junio,

On Sun, 3 Mar 2019, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > But for git-fetch, our primary purpose is receiving data and writing it
> > to disk. We should be checking all of our write()s already, and SIGPIPE
> > is just confusing.
> 
> Yup, sounds like a very sensible argument.
> 
> > So I'd actually be fine with just declaring that certain commands (like
> > fetch) just ignore SIGPIPE entirely.
> 
> Me too (to me, "actually be fine with" would mean "that's a larger
> hammer alternative, which is OK, while the base solution is fine,
> too").

So do I understand that you'd like something like this?

-- snipsnap --
t a/git.c b/git.c
index 2f604a41ea..c4122cc699 100644
--- a/git.c
+++ b/git.c
@@ -4,6 +4,7 @@
 #include "help.h"
 #include "run-command.h"
 #include "alias.h"
+#include "sigchain.h"
 
 #define RUN_SETUP		(1<<0)
 #define RUN_SETUP_GENTLY	(1<<1)
@@ -16,6 +17,7 @@
 #define SUPPORT_SUPER_PREFIX	(1<<4)
 #define DELAY_PAGER_CONFIG	(1<<5)
 #define NO_PARSEOPT		(1<<6) /* parse-options is not used */
+#define IGNORE_SIGPIPE		(1<<7)
 
 struct cmd_struct {
 	const char *cmd;
@@ -388,6 +390,8 @@ static int run_builtin(struct cmd_struct *p, int argc, const char **argv)
 	prefix = NULL;
 	help = argc == 2 && !strcmp(argv[1], "-h");
 	if (!help) {
+		if (p->option & IGNORE_SIGPIPE)
+			sigchain_push(SIGPIPE, SIG_IGN);
 		if (p->option & RUN_SETUP)
 			prefix = setup_git_directory();
 		else if (p->option & RUN_SETUP_GENTLY) {
@@ -477,7 +481,7 @@ static struct cmd_struct commands[] = {
 	{ "diff-tree", cmd_diff_tree, RUN_SETUP | NO_PARSEOPT },
 	{ "difftool", cmd_difftool, RUN_SETUP | NEED_WORK_TREE },
 	{ "fast-export", cmd_fast_export, RUN_SETUP },
-	{ "fetch", cmd_fetch, RUN_SETUP },
+	{ "fetch", cmd_fetch, RUN_SETUP | IGNORE_SIGPIPE },
 	{ "fetch-pack", cmd_fetch_pack, RUN_SETUP | NO_PARSEOPT },
 	{ "fmt-merge-msg", cmd_fmt_merge_msg, RUN_SETUP },
 	{ "for-each-ref", cmd_for_each_ref, RUN_SETUP },


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: t5570-git-daemon fails with SIGPIPE on OSX
  2019-03-02 21:21             ` Johannes Schindelin
@ 2019-03-03 16:54               ` Jeff King
  2019-03-03 16:55                 ` [PATCH 1/2] fetch: avoid calling write_or_die() Jeff King
  2019-03-03 16:58                 ` [PATCH 2/2] fetch: ignore SIGPIPE during network operation Jeff King
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2019-03-03 16:54 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, SZEDER Gábor, Git mailing list,
	Clemens Buchacher

On Sat, Mar 02, 2019 at 10:21:00PM +0100, Johannes Schindelin wrote:

> Do you want to turn these two patches into a proper patch series?
> Otherwise I can take care of it, probably this Monday or Tuesday.

Here they are. The old email sending the first one actually had a typo
(it sent "0000" instead of "0001" for some packets; oops), which is
fixed here.

I'm actually not sure of the placement for the second one; see the
comments there.

  [1/2]: fetch: avoid calling write_or_die()
  [2/2]: fetch: ignore SIGPIPE during network operation

 builtin/fetch.c | 2 ++
 fetch-pack.c    | 9 ++++++---
 pkt-line.c      | 6 ++++--
 3 files changed, 12 insertions(+), 5 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] fetch: avoid calling write_or_die()
  2019-03-03 16:54               ` Jeff King
@ 2019-03-03 16:55                 ` Jeff King
  2019-03-04 13:42                   ` Duy Nguyen
  2019-03-03 16:58                 ` [PATCH 2/2] fetch: ignore SIGPIPE during network operation Jeff King
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-03-03 16:55 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, SZEDER Gábor, Git mailing list,
	Clemens Buchacher

The write_or_die() function has one quirk that a caller might not
expect: when it sees EPIPE from the write() call, it translates that
into a death by SIGPIPE. This doesn't change the overall behavior (the
program exits either way), but it does potentially confuse test scripts
looking for a non-signal exit code.

Let's switch away from using write_or_die() in a few code paths, which
will give us more consistent exit codes. It also gives us the
opportunity to write more descriptive error messages, since we have
context that write_or_die() does not.

Note that this won't do much by itself, since we'd typically be killed
by SIGPIPE before write_or_die() even gets a chance to do its thing.
That will be addressed in the next patch.

Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 9 ++++++---
 pkt-line.c   | 6 ++++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 812be15d7e..dca249e9d7 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -191,8 +191,10 @@ static void send_request(struct fetch_pack_args *args,
 	if (args->stateless_rpc) {
 		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
 		packet_flush(fd);
-	} else
-		write_or_die(fd, buf->buf, buf->len);
+	} else {
+		if (write_in_full(fd, buf->buf, buf->len) < 0)
+			die_errno("unable to write to remote");
+	}
 }
 
 static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
@@ -1163,7 +1165,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
 
 	/* Send request */
 	packet_buf_flush(&req_buf);
-	write_or_die(fd_out, req_buf.buf, req_buf.len);
+	if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
+		die_errno("unable to write request to remote");
 
 	strbuf_release(&req_buf);
 	return ret;
diff --git a/pkt-line.c b/pkt-line.c
index d4b71d3e82..093b2f3976 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
 void packet_flush(int fd)
 {
 	packet_trace("0000", 4, 1);
-	write_or_die(fd, "0000", 4);
+	if (write_in_full(fd, "0000", 4) < 0)
+		die_errno("unable to write flush packet");
 }
 
 void packet_delim(int fd)
 {
 	packet_trace("0001", 4, 1);
-	write_or_die(fd, "0001", 4);
+	if (write_in_full(fd, "0001", 4) < 0)
+		die_errno("unable to write delim packet");
 }
 
 int packet_flush_gently(int fd)
-- 
2.21.0.675.g01c085a870


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] fetch: ignore SIGPIPE during network operation
  2019-03-03 16:54               ` Jeff King
  2019-03-03 16:55                 ` [PATCH 1/2] fetch: avoid calling write_or_die() Jeff King
@ 2019-03-03 16:58                 ` Jeff King
  2019-03-04  1:11                   ` Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Jeff King @ 2019-03-03 16:58 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, SZEDER Gábor, Git mailing list,
	Clemens Buchacher

The default SIGPIPE behavior can be useful for a command that generates
a lot of output: if the receiver of our output goes away, we'll be
notified asynchronously to stop generating it (typically by killing the
program).

But for a command like fetch, which is primarily concerned with
receiving data and writing it to disk, an unexpected SIGPIPE can be
awkward. We're already checking the return value of all of our write()
calls, and dying due to the signal takes away our chance to gracefully
handle the error.

On Linux, we wouldn't generally see SIGPIPE at all during fetch. If the
other side of the network connection hangs up, we'll see ECONNRESET. But
on OS X, we get a SIGPIPE, and the process is killed. This causes t5570
to racily fail, as we sometimes die by signal (instead of the expected
die() call) when the server side hangs up.

Let's ignore SIGPIPE during the network portion of the fetch, which will
cause our write() to return EPIPE, giving us consistent behavior across
platforms.

This fixes the test flakiness, but note that it stops short of fixing
the larger problem. The server side hit a fatal error, sent us an "ERR"
packet, and then hung up. We notice the failure because we're trying to
write to a closed socket. But by dying immediately, we never actually
read the ERR packet and report its content to the user. This is a (racy)
problem on all platforms. So this patch lays the groundwork from which
that problem might be fixed consistently, but it doesn't actually fix
it.

Note the placement of the SIGPIPE handling. The absolute minimal change
would be to ignore SIGPIPE only when we're writing. But twiddling the
signal handler for each write call is inefficient and maintenance
burden. On the opposite end of the spectrum, we could simply declare
that fetch does not need SIGPIPE handling, since it doesn't generate a
lot of output, and we could just ignore it at the start of cmd_fetch().

This patch takes a middle ground. It ignores SIGPIPE during the network
operation (which is admittedly most of the program, since the actual
network operations are all done under the hood by the transport code).
So it's still pretty coarse.

Signed-off-by: Jeff King <peff@peff.net>
---
I had a hard time trying to write the last two paragraphs. I think the
placement here really is not significantly different than just ignoring
SIGPIPE for the entirety of cmd_fetch(). I kind of like your suggestion
elsewhere in the thread to just teach the git wrapper an IGNORE_SIGPIPE
flag.

 builtin/fetch.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index b620fd54b4..4ba63d5ac6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1556,7 +1556,9 @@ static int fetch_one(struct remote *remote, int argc, const char **argv, int pru
 
 	sigchain_push_common(unlock_pack_on_signal);
 	atexit(unlock_pack);
+	sigchain_push(SIGPIPE, SIG_IGN);
 	exit_code = do_fetch(gtransport, &rs);
+	sigchain_pop(SIGPIPE);
 	refspec_clear(&rs);
 	transport_disconnect(gtransport);
 	gtransport = NULL;
-- 
2.21.0.675.g01c085a870

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fetch: ignore SIGPIPE during network operation
  2019-03-03 16:58                 ` [PATCH 2/2] fetch: ignore SIGPIPE during network operation Jeff King
@ 2019-03-04  1:11                   ` Junio C Hamano
  2019-03-05  4:11                     ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2019-03-04  1:11 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, SZEDER Gábor, Git mailing list,
	Clemens Buchacher

Jeff King <peff@peff.net> writes:

> ... But by dying immediately, we never actually
> read the ERR packet and report its content to the user. This is a (racy)
> problem on all platforms.

Yeah, I do not think of a good solution for it (nor I am not
convinced that it is worth fixing, to be honest.  The cable may get
cut before we have a chance to see the ERR packet, or the other side
may have died before producing one.

The fix obviously looks good.  Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fetch: avoid calling write_or_die()
  2019-03-03 16:55                 ` [PATCH 1/2] fetch: avoid calling write_or_die() Jeff King
@ 2019-03-04 13:42                   ` Duy Nguyen
  2019-03-05  4:11                     ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2019-03-04 13:42 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Junio C Hamano, SZEDER Gábor,
	Git mailing list, Clemens Buchacher

On Sun, Mar 3, 2019 at 11:55 PM Jeff King <peff@peff.net> wrote:
>
> The write_or_die() function has one quirk that a caller might not
> expect: when it sees EPIPE from the write() call, it translates that
> into a death by SIGPIPE. This doesn't change the overall behavior (the
> program exits either way), but it does potentially confuse test scripts
> looking for a non-signal exit code.
>
> Let's switch away from using write_or_die() in a few code paths, which
> will give us more consistent exit codes. It also gives us the
> opportunity to write more descriptive error messages, since we have
> context that write_or_die() does not.
>
> Note that this won't do much by itself, since we'd typically be killed
> by SIGPIPE before write_or_die() even gets a chance to do its thing.
> That will be addressed in the next patch.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  fetch-pack.c | 9 ++++++---
>  pkt-line.c   | 6 ++++--
>  2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 812be15d7e..dca249e9d7 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -191,8 +191,10 @@ static void send_request(struct fetch_pack_args *args,
>         if (args->stateless_rpc) {
>                 send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
>                 packet_flush(fd);
> -       } else
> -               write_or_die(fd, buf->buf, buf->len);
> +       } else {
> +               if (write_in_full(fd, buf->buf, buf->len) < 0)
> +                       die_errno("unable to write to remote");

maybe _() these strings.

> +       }
>  }
>
>  static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
> @@ -1163,7 +1165,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
>
>         /* Send request */
>         packet_buf_flush(&req_buf);
> -       write_or_die(fd_out, req_buf.buf, req_buf.len);
> +       if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
> +               die_errno("unable to write request to remote");
>
>         strbuf_release(&req_buf);
>         return ret;
> diff --git a/pkt-line.c b/pkt-line.c
> index d4b71d3e82..093b2f3976 100644
> --- a/pkt-line.c
> +++ b/pkt-line.c
> @@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
>  void packet_flush(int fd)
>  {
>         packet_trace("0000", 4, 1);
> -       write_or_die(fd, "0000", 4);
> +       if (write_in_full(fd, "0000", 4) < 0)
> +               die_errno("unable to write flush packet");
>  }
>
>  void packet_delim(int fd)
>  {
>         packet_trace("0001", 4, 1);
> -       write_or_die(fd, "0001", 4);
> +       if (write_in_full(fd, "0001", 4) < 0)
> +               die_errno("unable to write delim packet");
>  }
>
>  int packet_flush_gently(int fd)
> --
> 2.21.0.675.g01c085a870
>


-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fetch: ignore SIGPIPE during network operation
  2019-03-04  1:11                   ` Junio C Hamano
@ 2019-03-05  4:11                     ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2019-03-05  4:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, SZEDER Gábor, Git mailing list,
	Clemens Buchacher, Duy Nguyen

On Mon, Mar 04, 2019 at 10:11:46AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > ... But by dying immediately, we never actually
> > read the ERR packet and report its content to the user. This is a (racy)
> > problem on all platforms.
> 
> Yeah, I do not think of a good solution for it (nor I am not
> convinced that it is worth fixing, to be honest.  The cable may get
> cut before we have a chance to see the ERR packet, or the other side
> may have died before producing one.

We definitely can never cover this 100%. But I wonder if we could put a
little more effort into "best effort". Specifically, I was thinking that
on seeing the write error, we might do something like:

  void NORETURN last_ditch_proto_read(const char *msg)
  {
	char *line;

	/*
	 * we had a write error; see if the server sent us anything
	 * useful to report
	 */
	if (packet_read_line_gently(fd, NULL, &line) &&
	    skip_prefix(line, "ERR ", &line)) {
	        die("remote error: %s", line);
	}

	/* otherwise, just report the write error */
	die("%s", msg);
  }

The tricky thing is that the writer does not always know the correct fd
to read more packets from (since the write errors may be deep in generic
code). I suspect we could rig up some kind of hacky global "if this
descriptor variable is non-negative, then do a last ditch read from it".

I do wonder if the race is better or worse when doing local fetches in
the test suite. On a real network with actual transit times, I suspect
we'd do better, because our writes would still be fast (we dump bytes to
the OS buffer) and we'd spend a higher percentage of our time waiting to
read back from the other side (which is what we want, because then we
see the ERR they wrote to us). That's just a guess, though.

> The fix obviously looks good.  Thanks.

Yeah, I don't think any of the above discussion needs to block the fix
here.

Here's a re-roll of the first patch, though, marked for translation as
requested by Duy.

-- >8 --
Subject: [PATCH] fetch: avoid calling write_or_die()

The write_or_die() function has one quirk that a caller might not
expect: when it sees EPIPE from the write() call, it translates that
into a death by SIGPIPE. This doesn't change the overall behavior (the
program exits either way), but it does potentially confuse test scripts
looking for a non-signal exit code.

Let's switch away from using write_or_die() in a few code paths, which
will give us more consistent exit codes. It also gives us the
opportunity to write more descriptive error messages, since we have
context that write_or_die() does not.

Note that this won't do much by itself, since we'd typically be killed
by SIGPIPE before write_or_die() even gets a chance to do its thing.
That will be addressed in the next patch.

Signed-off-by: Jeff King <peff@peff.net>
---
 fetch-pack.c | 9 ++++++---
 pkt-line.c   | 6 ++++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 812be15d7e..e69993b2eb 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -191,8 +191,10 @@ static void send_request(struct fetch_pack_args *args,
 	if (args->stateless_rpc) {
 		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
 		packet_flush(fd);
-	} else
-		write_or_die(fd, buf->buf, buf->len);
+	} else {
+		if (write_in_full(fd, buf->buf, buf->len) < 0)
+			die_errno(_("unable to write to remote"));
+	}
 }
 
 static void insert_one_alternate_object(struct fetch_negotiator *negotiator,
@@ -1163,7 +1165,8 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
 
 	/* Send request */
 	packet_buf_flush(&req_buf);
-	write_or_die(fd_out, req_buf.buf, req_buf.len);
+	if (write_in_full(fd_out, req_buf.buf, req_buf.len) < 0)
+		die_errno(_("unable to write request to remote"));
 
 	strbuf_release(&req_buf);
 	return ret;
diff --git a/pkt-line.c b/pkt-line.c
index d4b71d3e82..6bd496a9bb 100644
--- a/pkt-line.c
+++ b/pkt-line.c
@@ -88,13 +88,15 @@ static void packet_trace(const char *buf, unsigned int len, int write)
 void packet_flush(int fd)
 {
 	packet_trace("0000", 4, 1);
-	write_or_die(fd, "0000", 4);
+	if (write_in_full(fd, "0000", 4) < 0)
+		die_errno(_("unable to write flush packet"));
 }
 
 void packet_delim(int fd)
 {
 	packet_trace("0001", 4, 1);
-	write_or_die(fd, "0001", 4);
+	if (write_in_full(fd, "0001", 4) < 0)
+		die_errno(_("unable to write delim packet"));
 }
 
 int packet_flush_gently(int fd)
-- 
2.21.0.684.gc9dc8b89c9


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fetch: avoid calling write_or_die()
  2019-03-04 13:42                   ` Duy Nguyen
@ 2019-03-05  4:11                     ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2019-03-05  4:11 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Johannes Schindelin, Junio C Hamano, SZEDER Gábor,
	Git mailing list, Clemens Buchacher

On Mon, Mar 04, 2019 at 08:42:40PM +0700, Duy Nguyen wrote:

> > -       } else
> > -               write_or_die(fd, buf->buf, buf->len);
> > +       } else {
> > +               if (write_in_full(fd, buf->buf, buf->len) < 0)
> > +                       die_errno("unable to write to remote");
> 
> maybe _() these strings.

Good thinking. I just sent an update in another part of the thread.

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-03-05  4:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-06 15:11 t5570-git-daemon fails with SIGPIPE on OSX SZEDER Gábor
2018-08-06 15:31 ` SZEDER Gábor
2019-02-08  8:32   ` Johannes Schindelin
2019-02-08 12:54     ` SZEDER Gábor
2018-08-14 22:32 ` Jeff King
2018-08-14 22:37   ` Jeff King
2019-02-08  9:02   ` Johannes Schindelin
2019-02-08  9:28     ` Johannes Schindelin
2019-02-08 19:54       ` Jeff King
2019-03-01 15:02         ` Johannes Schindelin
2019-03-01 19:00           ` Jeff King
2019-03-02 21:21             ` Johannes Schindelin
2019-03-03 16:54               ` Jeff King
2019-03-03 16:55                 ` [PATCH 1/2] fetch: avoid calling write_or_die() Jeff King
2019-03-04 13:42                   ` Duy Nguyen
2019-03-05  4:11                     ` Jeff King
2019-03-03 16:58                 ` [PATCH 2/2] fetch: ignore SIGPIPE during network operation Jeff King
2019-03-04  1:11                   ` Junio C Hamano
2019-03-05  4:11                     ` Jeff King
2019-03-03  1:21             ` t5570-git-daemon fails with SIGPIPE on OSX Junio C Hamano
2019-03-03 14:56               ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).