git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Is t5516 somehow flakey only on macOS?
@ 2021-01-09  0:37 Junio C Hamano
  2021-01-09  9:11 ` Eric Sunshine
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2021-01-09  0:37 UTC (permalink / raw)
  To: git

I've seen this breakage only on "regular (osx-clang, clang,
macos-latest)" so far, but that does not necessarily mean clang on
macOS is miscompiling or anything like that.

 https://github.com/git/git/runs/1671504998?check_suite_focus=true

What is frustrating is that the same CI/PR #1087 fully passes when
told to "Re-run jobs".

 https://github.com/git/git/runs/1671626677?check_suite_focus=true

They are testing exactly the same commit 820e568 that has a tree
identical to that of ed0fd57b97 in seen in today's pushout.

Any interest in helping to find out what's going on from folks in
the macOS land?

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09  0:37 Is t5516 somehow flakey only on macOS? Junio C Hamano
@ 2021-01-09  9:11 ` Eric Sunshine
  2021-01-09 10:34   ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Sunshine @ 2021-01-09  9:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

On Fri, Jan 8, 2021 at 7:39 PM Junio C Hamano <gitster@pobox.com> wrote:
> I've seen this breakage only on "regular (osx-clang, clang,
> macos-latest)" so far, but that does not necessarily mean clang on
> macOS is miscompiling or anything like that.
>
>  https://github.com/git/git/runs/1671504998?check_suite_focus=true
>
> Any interest in helping to find out what's going on from folks in
> the macOS land?

Something seems to be dying prematurely. The code in question from
test #86 "deny fetch unreachable SHA1, allowtipsha1inwant=false" is:

    test_must_fail env GIT_TEST_PROTOCOL_VERSION=0 \
        git fetch ../testrepo/.git $SHA1_3 2>err &&
    test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err

In a successful run, the content of `err` is:

    fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
    fatal: remote error: upload-pack: not our ref
64ea4c133d59fa98e86a771eda009872d6ab2886

However, in the unsuccessful run from CI, it is:

    fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
    fatal: unable to write to remote: Broken pipe

Perhaps this means something to someone familiar with this area of the code.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09  9:11 ` Eric Sunshine
@ 2021-01-09 10:34   ` Jeff King
  2021-01-09 10:48     ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2021-01-09 10:34 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: SZEDER Gábor, Junio C Hamano, Git List

On Sat, Jan 09, 2021 at 04:11:09AM -0500, Eric Sunshine wrote:

> In a successful run, the content of `err` is:
> 
>     fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
>     fatal: remote error: upload-pack: not our ref
> 64ea4c133d59fa98e86a771eda009872d6ab2886
> 
> However, in the unsuccessful run from CI, it is:
> 
>     fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
>     fatal: unable to write to remote: Broken pipe
> 
> Perhaps this means something to someone familiar with this area of the code.

This is the same issue discussed in this sub-thread:

  https://lore.kernel.org/git/20190828153944.GA29715@sigill.intra.peff.net/

It's not macOS-specific, but for whatever reason it's much easier to
trigger there. There's a reliable reproduction with a sleep() in:

  https://lore.kernel.org/git/20190829220630.GH8571@szeder.dev/

The solution is something like the "after a write() error, pump the read
channel to see if we got an ERR packet" patch Gábor showed in the reply
to that message.

But an interesting thing to note: this particular race only happens with
protocol v0 (and the test is forcing v0, because v2 would not produce a
"not our ref" error in the first place). In general I'd suspect that v2
is a bit less susceptible to these types of races, because it has more
of a request/response flow. But it's still possible (e.g., if the server
side ever aborts before reading the whole request).

For this _particular_ test, since we know that it is testing a v0-only
behavior, we might want to just loosen the test. This goes against the
point of adding it in 014ade7484 (upload-pack: send ERR packet for
non-tip objects, 2019-04-13), but it's the best we can do for now.
Something like this:

diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 01004ff680..8726d80250 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1260,7 +1260,20 @@ do
 			git cat-file commit $SHA1_2 &&
 			test_must_fail env GIT_TEST_PROTOCOL_VERSION=0 \
 				git fetch ../testrepo/.git $SHA1_3 2>err &&
-			test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err
+			# ideally we would insist this be on a "remote error:"
+			# line to make sure an ERR packet was sent. But
+			# checking that is racy, because the client may see the
+			# pipe close while they are still writing their "done"
+			# line. So check only that it appears in stderr at all,
+			# which will generally pick up the copy that
+			# upload-pack sends to stderr. This is a hack, because
+			# in the real world the upload-pack stderr would often
+			# not be shown to the client at all (e.g., over git://
+			# or over an ssh channel to a non-vanilla server). So
+			# this is papering over a real racy bug that users
+			# might see in the wild, but this is the best we can do
+			# for now to make our tests robust.
+			test_i18ngrep "not our ref.*$SHA1_3\$" err
 		)
 	'
 done

-Peff

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 10:34   ` Jeff King
@ 2021-01-09 10:48     ` Jeff King
  2021-01-09 10:57       ` Eric Sunshine
                         ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jeff King @ 2021-01-09 10:48 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: SZEDER Gábor, Junio C Hamano, Git List

On Sat, Jan 09, 2021 at 05:34:05AM -0500, Jeff King wrote:

> For this _particular_ test, since we know that it is testing a v0-only
> behavior, we might want to just loosen the test. This goes against the
> point of adding it in 014ade7484 (upload-pack: send ERR packet for
> non-tip objects, 2019-04-13), but it's the best we can do for now.
> Something like this:

Since this issue has been languishing for a while now with several
"something like this" patches, I've packaged it up into something more
palatable. I think we should just apply this and move on. We may still
run into other similar races, but I don't think this one is worth
spending more mental effort on.

-- >8 --
Subject: [PATCH] t5516: loosen "not our ref" error check

Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
2019-04-13) added a test that greps the output of a failed fetch to make
sure that upload-pack sent us the ERR packet we expected. But checking
this is racy; despite the argument in that commit, the client may still
be sending a "done" line when the server exits, causing it to die() on a
failed write() and never see the ERR packet at all.

This fails quite rarely on Linux, but more often on macOS. However, it
can be triggered reliably with:

	diff --git a/fetch-pack.c b/fetch-pack.c
	index 876f90c759..cf40de9092 100644
	--- a/fetch-pack.c
	+++ b/fetch-pack.c
	@@ -489,6 +489,7 @@ static int find_common(struct fetch_negotiator *negotiator,
	 done:
	 	trace2_region_leave("fetch-pack", "negotiation_v0_v1", the_repository);
	 	if (!got_ready || !no_done) {
	+		sleep(1);
	 		packet_buf_write(&req_buf, "done\n");
	 		send_request(args, fd[1], &req_buf);
	 	}

This is a real user-visible race that it would be nice to fix, but it's
tricky to do so: the client would have to speculatively try to read an
ERR packet after hitting a write() error. And at least for this error,
it's specific to v0 (since v2 does not enforce reachability at all).

So let's loosen to test to avoid annoying racy failures. If we
eventually do the read-after-failed-write thing, we can tighten it. And
if not, v0 will grow increasingly obsolete as servers support v2, so the
utility of this test will decrease over time anyway.

Note that we can still check stderr to make sure upload-pack bailed for
the reason we expected. It writes a similar message to stderr, and
because the server side is just another process connected by pipes,
we'll reliably see it. This would not be the case for git://, or for
ssh servers that do not relay stderr (e.g., GitHub's custom endpoint
does not).

Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5516-fetch-push.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 01004ff680..3ed121d0ce 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1260,7 +1260,9 @@ do
 			git cat-file commit $SHA1_2 &&
 			test_must_fail env GIT_TEST_PROTOCOL_VERSION=0 \
 				git fetch ../testrepo/.git $SHA1_3 2>err &&
-			test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err
+			# ideally we would insist this be on a "remote error:"
+			# line, but it is racy; see the commit message
+			test_i18ngrep "not our ref.*$SHA1_3\$" err
 		)
 	'
 done
-- 
2.30.0.453.g66a128ec8e


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 10:48     ` Jeff King
@ 2021-01-09 10:57       ` Eric Sunshine
  2021-01-09 17:33       ` SZEDER Gábor
  2021-01-09 22:29       ` Junio C Hamano
  2 siblings, 0 replies; 9+ messages in thread
From: Eric Sunshine @ 2021-01-09 10:57 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Junio C Hamano, Git List

On Sat, Jan 9, 2021 at 5:48 AM Jeff King <peff@peff.net> wrote:
> Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> 2019-04-13) added a test that greps the output of a failed fetch to make
> sure that upload-pack sent us the ERR packet we expected. But checking
> this is racy; despite the argument in that commit, the client may still
> be sending a "done" line when the server exits, causing it to die() on a
> failed write() and never see the ERR packet at all.
> [...]
> So let's loosen to test to avoid annoying racy failures. If we

s/to test to/the test to/

> eventually do the read-after-failed-write thing, we can tighten it. And
> if not, v0 will grow increasingly obsolete as servers support v2, so the
> utility of this test will decrease over time anyway.
>
> Note that we can still check stderr to make sure upload-pack bailed for
> the reason we expected. It writes a similar message to stderr, and
> because the server side is just another process connected by pipes,
> we'll reliably see it. This would not be the case for git://, or for
> ssh servers that do not relay stderr (e.g., GitHub's custom endpoint
> does not).
>
> Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 10:48     ` Jeff King
  2021-01-09 10:57       ` Eric Sunshine
@ 2021-01-09 17:33       ` SZEDER Gábor
  2021-01-09 22:42         ` Junio C Hamano
  2021-01-09 22:29       ` Junio C Hamano
  2 siblings, 1 reply; 9+ messages in thread
From: SZEDER Gábor @ 2021-01-09 17:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, Junio C Hamano, Git List

On Sat, Jan 09, 2021 at 05:48:10AM -0500, Jeff King wrote:
> On Sat, Jan 09, 2021 at 05:34:05AM -0500, Jeff King wrote:
> 
> > For this _particular_ test, since we know that it is testing a v0-only
> > behavior, we might want to just loosen the test. This goes against the
> > point of adding it in 014ade7484 (upload-pack: send ERR packet for
> > non-tip objects, 2019-04-13), but it's the best we can do for now.
> > Something like this:
> 
> Since this issue has been languishing for a while now with several
> "something like this" patches, I've packaged it up into something more
> palatable. I think we should just apply this and move on. We may still
> run into other similar races, but I don't think this one is worth
> spending more mental effort on.
> 
> -- >8 --
> Subject: [PATCH] t5516: loosen "not our ref" error check
> 
> Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> 2019-04-13) added a test that greps the output of a failed fetch to make
> sure that upload-pack sent us the ERR packet we expected. But checking
> this is racy; despite the argument in that commit, the client may still
> be sending a "done" line when the server exits, causing it to die() on a

Nit: I think using the word "after" would make the problematic
sequence of events a tad clearer, i.e. "... after the server has
exited, ...".

> failed write() and never see the ERR packet at all.
> 
> This fails quite rarely on Linux, but more often on macOS. However, it
> can be triggered reliably with:
> 
> 	diff --git a/fetch-pack.c b/fetch-pack.c
> 	index 876f90c759..cf40de9092 100644
> 	--- a/fetch-pack.c
> 	+++ b/fetch-pack.c
> 	@@ -489,6 +489,7 @@ static int find_common(struct fetch_negotiator *negotiator,
> 	 done:
> 	 	trace2_region_leave("fetch-pack", "negotiation_v0_v1", the_repository);
> 	 	if (!got_ready || !no_done) {
> 	+		sleep(1);
> 	 		packet_buf_write(&req_buf, "done\n");
> 	 		send_request(args, fd[1], &req_buf);
> 	 	}

FWIW (not much?), I've run the test suite with that sleep(1) in place,
and there were no other test failures.

> This is a real user-visible race that it would be nice to fix, but it's
> tricky to do so: the client would have to speculatively try to read an
> ERR packet after hitting a write() error. And at least for this error,
> it's specific to v0 (since v2 does not enforce reachability at all).
> 
> So let's loosen to test to avoid annoying racy failures. If we
> eventually do the read-after-failed-write thing, we can tighten it. And
> if not, v0 will grow increasingly obsolete as servers support v2, so the
> utility of this test will decrease over time anyway.

Makes sense.  Back then when I investigated this issue the default
protocol was still v0; now that we default to v2 I agree its better to
work around the issue in the test instead of "fixing" the root cause
with that "trying to read ERR packet on error" hack.

Good, a year-and-a-half old entry checked off from my todo list :)
Thanks.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 10:48     ` Jeff King
  2021-01-09 10:57       ` Eric Sunshine
  2021-01-09 17:33       ` SZEDER Gábor
@ 2021-01-09 22:29       ` Junio C Hamano
  2 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2021-01-09 22:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Sunshine, SZEDER Gábor, Git List

Jeff King <peff@peff.net> writes:

> Since this issue has been languishing for a while now with several
> "something like this" patches, I've packaged it up into something more
> palatable. I think we should just apply this and move on. We may still
> run into other similar races, but I don't think this one is worth
> spending more mental effort on.

Thanks, will queue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 17:33       ` SZEDER Gábor
@ 2021-01-09 22:42         ` Junio C Hamano
  2021-01-10  3:23           ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2021-01-09 22:42 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Jeff King, Eric Sunshine, Git List

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
>> 2019-04-13) added a test that greps the output of a failed fetch to make
>> sure that upload-pack sent us the ERR packet we expected. But checking
>> this is racy; despite the argument in that commit, the client may still
>> be sending a "done" line when the server exits, causing it to die() on a
>
> Nit: I think using the word "after" would make the problematic
> sequence of events a tad clearer, i.e. "... after the server has
> exited, ...".

Thanks everybody for helping with extra clarity.  I do find it
easier to follow with "after the server has exited" to picture
the sequence of events in my mind.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Is t5516 somehow flakey only on macOS?
  2021-01-09 22:42         ` Junio C Hamano
@ 2021-01-10  3:23           ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2021-01-10  3:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: SZEDER Gábor, Eric Sunshine, Git List

On Sat, Jan 09, 2021 at 02:42:56PM -0800, Junio C Hamano wrote:

> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> >> Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> >> 2019-04-13) added a test that greps the output of a failed fetch to make
> >> sure that upload-pack sent us the ERR packet we expected. But checking
> >> this is racy; despite the argument in that commit, the client may still
> >> be sending a "done" line when the server exits, causing it to die() on a
> >
> > Nit: I think using the word "after" would make the problematic
> > sequence of events a tad clearer, i.e. "... after the server has
> > exited, ...".
> 
> Thanks everybody for helping with extra clarity.  I do find it
> easier to follow with "after the server has exited" to picture
> the sequence of events in my mind.

Agreed. Here's the same patch with the commit message updated to use
"after", plus Eric's typofix.

-- >8 --
Subject: [PATCH] t5516: loosen "not our ref" error check

Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
2019-04-13) added a test that greps the output of a failed fetch to make
sure that upload-pack sent us the ERR packet we expected. But checking
this is racy; despite the argument in that commit, the client may still
be sending a "done" line after the server exits, causing it to die() on
a failed write() and never see the ERR packet at all.

This fails quite rarely on Linux, but more often on macOS. However, it
can be triggered reliably with:

	diff --git a/fetch-pack.c b/fetch-pack.c
	index 876f90c759..cf40de9092 100644
	--- a/fetch-pack.c
	+++ b/fetch-pack.c
	@@ -489,6 +489,7 @@ static int find_common(struct fetch_negotiator *negotiator,
	 done:
	 	trace2_region_leave("fetch-pack", "negotiation_v0_v1", the_repository);
	 	if (!got_ready || !no_done) {
	+		sleep(1);
	 		packet_buf_write(&req_buf, "done\n");
	 		send_request(args, fd[1], &req_buf);
	 	}

This is a real user-visible race that it would be nice to fix, but it's
tricky to do so: the client would have to speculatively try to read an
ERR packet after hitting a write() error. And at least for this error,
it's specific to v0 (since v2 does not enforce reachability at all).

So let's loosen the test to avoid annoying racy failures. If we
eventually do the read-after-failed-write thing, we can tighten it. And
if not, v0 will grow increasingly obsolete as servers support v2, so the
utility of this test will decrease over time anyway.

Note that we can still check stderr to make sure upload-pack bailed for
the reason we expected. It writes a similar message to stderr, and
because the server side is just another process connected by pipes,
we'll reliably see it. This would not be the case for git://, or for
ssh servers that do not relay stderr (e.g., GitHub's custom endpoint
does not).

Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5516-fetch-push.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 01004ff680..3ed121d0ce 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1260,7 +1260,9 @@ do
 			git cat-file commit $SHA1_2 &&
 			test_must_fail env GIT_TEST_PROTOCOL_VERSION=0 \
 				git fetch ../testrepo/.git $SHA1_3 2>err &&
-			test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err
+			# ideally we would insist this be on a "remote error:"
+			# line, but it is racy; see the commit message
+			test_i18ngrep "not our ref.*$SHA1_3\$" err
 		)
 	'
 done
-- 
2.30.0.455.g8a2dc96235


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-10  3:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-09  0:37 Is t5516 somehow flakey only on macOS? Junio C Hamano
2021-01-09  9:11 ` Eric Sunshine
2021-01-09 10:34   ` Jeff King
2021-01-09 10:48     ` Jeff King
2021-01-09 10:57       ` Eric Sunshine
2021-01-09 17:33       ` SZEDER Gábor
2021-01-09 22:42         ` Junio C Hamano
2021-01-10  3:23           ` Jeff King
2021-01-09 22:29       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).