git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* O_NONBLOCK: should I blame git or ssh?
@ 2019-09-07 23:50 Douglas Graham
  2019-09-08 10:28 ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Douglas Graham @ 2019-09-07 23:50 UTC (permalink / raw)
  To: git@vger.kernel.org

We have a parallel build that occasionally fails with the error message
"make: write error".  Make prints that error message as it is exiting when
it detects that it has seen errors while writing to stdout.  The error it
is enountering is an EAGAIN error, which implies that something has made
its stdout non-blocking.  As far as I've been able to tell so far, this is
occurring while make is running the command "git fetch --quiet --tags".
Once that command finishes, stdout goes back to being blocking but since
this is a parallel build, make is doing other work while this git command
is running, and may attempt to write to stdout during that time.

By stracing this git command, I can see it running subcommand

ssh -p 29418 user@gerrit.domain "git-upload-pack '/repo'"

and I can see that ssh command doing this:

39828 dup(0)                            = 5
39828 dup(1)                            = 6
39828 dup(2)                            = 7
39828 ioctl(5, TCGETS, 0x7ffea2880800)  = -1 ENOTTY (Inappropriate ioctl for device)
39828 fcntl(5, F_GETFL)                 = 0 (flags O_RDONLY)
39828 fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
39828 ioctl(6, TCGETS, 0x7ffea2880800)  = -1 ENOTTY (Inappropriate ioctl for device)
39828 fcntl(6, F_GETFL)                 = 0x1 (flags O_WRONLY)
39828 fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
39828 ioctl(7, TCGETS, {B38400 opost isig icanon echo ...}) = 0
39828 fcntl(5, F_SETFD, FD_CLOEXEC)     = 0
39828 fcntl(6, F_SETFD, FD_CLOEXEC)     = 0
39828 fcntl(7, F_SETFD, FD_CLOEXEC)     = 0
...
39828 ioctl(0, TCGETS, 0x7ffea28806e0)  = -1 ENOTTY (Inappropriate ioctl for device)
39828 fcntl(0, F_GETFL)                 = 0x800 (flags O_RDONLY|O_NONBLOCK)
39828 fcntl(0, F_SETFL, O_RDONLY)       = 0
39828 ioctl(1, TCGETS, 0x7ffea28806e0)  = -1 ENOTTY (Inappropriate ioctl for device)
39828 fcntl(1, F_GETFL)                 = 0x801 (flags O_WRONLY|O_NONBLOCK)
39828 fcntl(1, F_SETFL, O_WRONLY)       = 0
39828 ioctl(2, TCGETS, {B38400 opost isig icanon echo ...}) = 0

So ssh has dup'd descriptors 0, 1, and 2, and then turned on the O_NONBLOCK flag on
the copies of stdin and stdout.  You can see afterwards that ssh reads the flags on
descriptors 0 and 1, and both have O_NONBLOCK set.  It then clears that bit.  It set O_NONBLOCK
near the beginning of its runs an cleared it near the end.

Should this be considered a git bug or an ssh bug or something else?

I thought I had finally figured out exactly what is happening but while writing this,
now I'm not sure why my workaround appears to be working.  My workaround is to pipe
make's stdout into a simple program that reads make's output and writes it to where
make uses to write to, except it does a select() on descriptor 1 before writing, and
it makes sure to handle short counts.  But now I'm thinking that if it's the ssh started
indirectly by make that is messing with O_NONBLOCK, presumably it would be messing
with O_NONBLOCK on the write side of the pipe that make writes to, so make should
still be encountering EAGAIN errors. And yet my workaround does seem to work.

Thanks for any light you can shed on this.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: O_NONBLOCK: should I blame git or ssh?
  2019-09-07 23:50 O_NONBLOCK: should I blame git or ssh? Douglas Graham
@ 2019-09-08 10:28 ` Jeff King
  2019-09-08 14:18   ` Douglas Graham
  2019-09-09 15:52   ` Douglas Graham
  0 siblings, 2 replies; 5+ messages in thread
From: Jeff King @ 2019-09-08 10:28 UTC (permalink / raw)
  To: Douglas Graham; +Cc: git@vger.kernel.org

On Sat, Sep 07, 2019 at 11:50:41PM +0000, Douglas Graham wrote:

> We have a parallel build that occasionally fails with the error message
> "make: write error".  Make prints that error message as it is exiting when
> it detects that it has seen errors while writing to stdout.  The error it
> is enountering is an EAGAIN error, which implies that something has made
> its stdout non-blocking.  As far as I've been able to tell so far, this is
> occurring while make is running the command "git fetch --quiet --tags".
> Once that command finishes, stdout goes back to being blocking but since
> this is a parallel build, make is doing other work while this git command
> is running, and may attempt to write to stdout during that time.

As I understand it, the mechanism you describe here for your symptoms is
plausible, because O_NONBLOCK is not attached to a particular
descriptor, but a POSIX "file description". So a sub-process setting
flags on its stdout can impact the parent process.

But...

> By stracing this git command, I can see it running subcommand
> 
> ssh -p 29418 user@gerrit.domain "git-upload-pack '/repo'"

I don't think this can directly be the culprit, because that ssh's
stdout will be hooked to a pipe talking to Git, not to the original
stdout of "git fetch". It should not have even received a descriptor
that is a copy of the original stdout (nor stdin), since those would
have been closed as part of the fork+exec.

The child ssh _does_ have access to the original stderr, which could
plausibly be a dup of the original stdout. But your strace shows ssh
setting the flag only for stdin/stdout.

> Should this be considered a git bug or an ssh bug or something else?

I'm not convinced that what you've found is the root cause of your
problem. But even if it is, I don't think you can fault Git. We never
touch the descriptor flags ourselves, and ssh is a black box to us.

> I thought I had finally figured out exactly what is happening but while writing this,
> now I'm not sure why my workaround appears to be working.  My workaround is to pipe
> make's stdout into a simple program that reads make's output and writes it to where
> make uses to write to, except it does a select() on descriptor 1 before writing, and
> it makes sure to handle short counts.  But now I'm thinking that if it's the ssh started
> indirectly by make that is messing with O_NONBLOCK, presumably it would be messing
> with O_NONBLOCK on the write side of the pipe that make writes to, so make should
> still be encountering EAGAIN errors. And yet my workaround does seem to work.

It's possible that it helps because your pipe-reading program is
(almost) always trying to read, and thus "make" on the writing end of
the pipe never sees EAGAIN. As long as your program reads the output
before the pipe buffer fills up, you'd be fine. Where did the output go
in your original? If to a regular file, I'd expect O_NONBLOCK to have no
effect anyway.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: O_NONBLOCK: should I blame git or ssh?
  2019-09-08 10:28 ` Jeff King
@ 2019-09-08 14:18   ` Douglas Graham
  2019-09-09 17:04     ` Jeff King
  2019-09-09 15:52   ` Douglas Graham
  1 sibling, 1 reply; 5+ messages in thread
From: Douglas Graham @ 2019-09-08 14:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

Thanks for giving this some thought.

> I don't think this can directly be the culprit, because that ssh's stdout will be
> hooked to a pipe talking to Git, not to the original stdout of "git fetch". It
> should not have even received a descriptor that is a copy of the original
> stdout (nor stdin), since those would have been closed as part of the
> fork+exec.
> 
> The child ssh _does_ have access to the original stderr, which could plausibly
> be a dup of the original stdout. But your strace shows ssh setting the flag
> only for stdin/stdout.

I wondered about that too.  I also wondered why we only have this problem
when doing builds with Jenkins.  The same error has never happened when doing
builds manually as far as I know.  However, stracing the build while it is
running under Jenkins is difficult, so my strace output is from a manual run.
It turns out that ssh only sets non-blocking mode on a descriptor if that
descriptor does not refer to a TTY.  The code in function ssh_session2_open()
looks like:

        if (stdin_null_flag) {
                in = open(_PATH_DEVNULL, O_RDONLY);
        } else {
                in = dup(STDIN_FILENO);
        }
        out = dup(STDOUT_FILENO);
        err = dup(STDERR_FILENO);

        /* enable nonblocking unless tty */
        if (!isatty(in))
                set_nonblock(in);
        if (!isatty(out))
                set_nonblock(out);
        if (!isatty(err))
                set_nonblock(err);

When I collected that strace output, I had stdout redirected to a pipe to my
workaround program, but I did not redirect stderr.  So ssh made stdout non-blocking,
but since stderr was still connected to my terminal, it didn't touch that. But when
this build is run under Jenkins, both stdout and stderr are connected to a pipe that
Jenkins creates to collect output from the build. I assume that when git runs ssh, it
does not redirect ssh's stderr to its own pipe, it only redirects stdout. So I think
ssh will be messing with both pipes when this build is run under Jenkins.

Now that I have a fairly good understanding of what's happening, I think I can work
around this occasional error by redirecting git's stderr to a file or something like
that, but it's taken us a long time to figure this out, so I wonder if a more permanent
fix shouldn't be implement, so others don't run into the same problem.  A google for
"make: write error" indicates that we're not the first to have this problem with
parallel builds, although in the other cases I've found, a specific version of the
Linux kernel was being blamed.  Maybe that was a different problem.

I guess git could workaround this by redirecting stderr, but the real problem is probably
with ssh, although it's not clear to me what it should do differently. It does some
somehow backwards to me that that it only makes a descriptor non-blocking if it doesn't
refer to a TTY, but it does the same thing in at least three different places so I guess
that's  not a mistake.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: O_NONBLOCK: should I blame git or ssh?
  2019-09-08 10:28 ` Jeff King
  2019-09-08 14:18   ` Douglas Graham
@ 2019-09-09 15:52   ` Douglas Graham
  1 sibling, 0 replies; 5+ messages in thread
From: Douglas Graham @ 2019-09-09 15:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

I think I've just figured out why my workaround works.  When ssh sets O_NONBLOCK
on stdout, it's setting it on the write side of the pipe that git created.  Nothing
else writes to that pipe, so this does not cause a problem.  Presumably, ssh itself
is able to deal with writing to a non-blocking "file description". When ssh sets
O_NONBLOCK on stderr, it's setting it on the pipe created by Jenkins.  But that's
the same pipe that my workaround program does the select() on, so my workaround
can compensate. When I run "make | workaround", the write side of *that* pipe is
never made non-blocking, and that's the one that make itself writes to.

Ok, I think all mysteries are solved, but I'm still not sure what the moral of
the story is. This seems like it could lead to some really difficult to debug
problems, but it's not clear to me what the right long-term fix should be. For
now, we'll just leave our workaround in place and think about tweaking our
makefile to redirect git's stderr.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: O_NONBLOCK: should I blame git or ssh?
  2019-09-08 14:18   ` Douglas Graham
@ 2019-09-09 17:04     ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2019-09-09 17:04 UTC (permalink / raw)
  To: Douglas Graham; +Cc: git@vger.kernel.org

On Sun, Sep 08, 2019 at 02:18:15PM +0000, Douglas Graham wrote:

> When I collected that strace output, I had stdout redirected to a pipe to my
> workaround program, but I did not redirect stderr.  So ssh made stdout non-blocking,
> but since stderr was still connected to my terminal, it didn't touch that. But when
> this build is run under Jenkins, both stdout and stderr are connected to a pipe that
> Jenkins creates to collect output from the build. I assume that when git runs ssh, it
> does not redirect ssh's stderr to its own pipe, it only redirects stdout. So I think
> ssh will be messing with both pipes when this build is run under Jenkins.

OK, that makes sense.

> Now that I have a fairly good understanding of what's happening, I think I can work
> around this occasional error by redirecting git's stderr to a file or something like
> that, but it's taken us a long time to figure this out, so I wonder if a more permanent
> fix shouldn't be implement, so others don't run into the same problem.  A google for
> "make: write error" indicates that we're not the first to have this problem with
> parallel builds, although in the other cases I've found, a specific version of the
> Linux kernel was being blamed.  Maybe that was a different problem.
> 
> I guess git could workaround this by redirecting stderr, but the real problem is probably
> with ssh, although it's not clear to me what it should do differently. It does some
> somehow backwards to me that that it only makes a descriptor non-blocking if it doesn't
> refer to a TTY, but it does the same thing in at least three different places so I guess
> that's  not a mistake.

Where would git redirect the stderr to? We definitely want to make sure
it goes to our original stderr, since it can have useful content for the
user to see. We could make a new pipe and then pump the output back to
our original stderr. But besides being complex, that fools the
downstream programs about whether stderr is a tty (I don't know whether
ssh cares, but certainly git itself uses that to decide on some elements
of the output, mostly progress meters).

So I think it would make more sense to talk to ssh folks about why this
momentary O_NONBLOCK setting happens, and if it can be avoided.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-09-09 17:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-07 23:50 O_NONBLOCK: should I blame git or ssh? Douglas Graham
2019-09-08 10:28 ` Jeff King
2019-09-08 14:18   ` Douglas Graham
2019-09-09 17:04     ` Jeff King
2019-09-09 15:52   ` Douglas Graham

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).