git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Josh Steadmon <steadmon@google.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH] run-command: don't spam trace2_child_exit()
Date: Tue, 7 Jun 2022 11:45:11 -0700	[thread overview]
Message-ID: <Yp+ctyKvKGR2aZ/s@google.com> (raw)
In-Reply-To: <ef011c3d-dd4b-f982-aa66-9167bf3e6c91@jeffhostetler.com>

On 2022.05.10 16:37, Jeff Hostetler wrote:
> 
> 
> On 5/5/22 3:58 PM, Josh Steadmon wrote:
> > On 2022.05.03 10:59, Jeff Hostetler wrote:
> > > 
> > > 
> > > On 4/28/22 5:46 PM, Junio C Hamano wrote:
> > > > Josh Steadmon <steadmon@google.com> writes:
> > > > 
> > > > > In rare cases, wait_or_whine() cannot determine a child process's exit
> > > > > status (and will return -1 in this case). This can cause Git to issue
> > > > > trace2 child_exit events despite the fact that the child is still
> > > > > running.
> > > 
> > > I'm curious what is causing the spurious return values.
> > > Could you instrument wait_or_whine() and see which of the
> > > if/else arms are causing the -1 to be returned?
> > > 
> > > That routine is rather complicated and looks like it has 3
> > > different ways that a -1 could be returned.
> > 
> > Unfortunately we haven't been able to reproduce the issue, so
> > instrumentation won't clarify the cause unless we randomly hit another
> > instance.
> > 
> > 
> > > > 
> > > > Rather, we do not even know if the child is still running when it
> > > > happens, right?  It is curious what "rare cases" makes the symptom
> > > > appear.  Do we know?
> > > > 
> > > > The patch looks OK from the "we do not know the child exited in this
> > > > case, so we shouldn't be reporting the child exit" point of view, of
> > > > course.  Having one event that started a child in the log and then
> > > > having millions of events that reports the exit of the (same) child
> > > > is way too broken.  With this change, we remove these phoney exit
> > > > events from the log.
> > > > 
> > > > Do we know, for such a child process that caused these millions
> > > > phoney exit events, we got a real exit event at the end?  Otherwise,
> > > > we'd still have a similar problem in the opposite direction, i.e. a
> > > > child has a start event recorded, many exit event discarded but the
> > > > log lacks the true exit event for the child, implying that the child
> > > > is still running because we failed to log its exit?
> > > > 
> > > > >    int finish_command_in_signal(struct child_process *cmd)
> > > > >    {
> > > > >    	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > > > > -	trace2_child_exit(cmd, ret);
> > > > > +	if (ret != -1)
> > > > > +		trace2_child_exit(cmd, ret);
> > > > >    	return ret;
> > > > >    }
> > > 
> > > Since this is only called from pager.c and is used to setup the
> > > pager, I have to wonder if you're only getting these spurious events
> > > for the pager process or for any of the other random child processes.
> > > 
> > > And whether they are received while the pager is alive and working
> > > properly, or when you're trying to quit the pager or when the pager
> > > is trying to signal eof.
> > 
> > It was for a pager run by `git diff`. The only other child process in
> > this instance was `git status`. I believe the events were generated
> > while the user was not interacting with the process; IIRC the pager was
> > left running in a terminal overnight and they discovered the next
> > morning that their disk was filled up by trace logs.
> > 
> 
> This worries me that you've found a race or infinite loop here
> and the only symptom is the child exit messages.  Suppressing
> them solves the disk-fill problem, but might mean that there's
> a lurking infinite loop or something.
> 
> Can you (privately) send me the part of the log before the
> infinite loop?  And maybe the parent and child processes too?
> I'd kinda like to understand what's happening here.
> 
> Jeff

Unfortunately we no longer have the relevant logs. This incident
happened in 2020 and the logs have expired. I can share some general
details though:

The session started on a Friday when the user ran `git diff`. Presumably
the user left the pager open after they left for the weekend. On the
following Saturday we started getting ~30M child_exit events per day
until their disk filled up early Monday morning.

The process was running on a VM, and I have reason to believe that the
user's SSH connection to the VM was broken at some point on the Saturday
that the spammy events started appearing. I don't know whether or not
the user was using any sort of session manager / terminal multiplexer
such as screen or tmux. I am curious whether or not the broken SSH
connection could have created some kind of signal loop that caused the
pager to continuously trigger our wait_for_pager_signal() handler.

I don't have any useful info beyond that, sorry. Thanks for taking a
look though!

  reply	other threads:[~2022-06-07 21:14 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-28 20:58 Josh Steadmon
2022-04-28 21:46 ` Junio C Hamano
2022-05-03 14:59   ` Jeff Hostetler
2022-05-05 19:58     ` Josh Steadmon
2022-05-10 20:37       ` Jeff Hostetler
2022-06-07 18:45         ` Josh Steadmon [this message]
2022-05-05 19:44   ` Josh Steadmon
2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon
2022-06-07 22:09   ` Ævar Arnfjörð Bjarmason
2022-06-10 15:31   ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yp+ctyKvKGR2aZ/s@google.com \
    --to=steadmon@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).