git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* clone hang prevention / timeout?
@ 2016-04-11 21:49 Jason Vas Dias
  2016-04-12  8:01 ` Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Jason Vas Dias @ 2016-04-11 21:49 UTC (permalink / raw)
  To: git

It appears GIT has no way of specifying a timeout for a clone operation -
if the server decides not to complete a get request, the clone can
hang forever -
is this correct ?
This appears to be what I am seeing, in a script that is attempting to do many
successive clone operations, eg. of
git://anongit.freedesktop.org/xorg/* , the script
occasionally hangs in a clone - I can see with netstat + strace that the TCP
connection is open and GIT is trying to read .
Is there any option I can specify to get the clone to timeout, or do I manually
have to strace the git process and send it a signal after a hang is detected?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: clone hang prevention / timeout?
  2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias
@ 2016-04-12  8:01 ` Eric Wong
  2016-04-13 22:24 ` Jeff King
  2016-04-13 22:29 ` Jeff King
  2 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-12  8:01 UTC (permalink / raw)
  To: Jason Vas Dias; +Cc: git

Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> It appears GIT has no way of specifying a timeout for a clone operation -
> if the server decides not to complete a get request, the clone can
> hang forever -
> is this correct ?

git uses SO_KEEPALIVE for all connections it makes, so whatever
your kernel TCP keepalive knobs are set at.

By default, it's very long (around 2 hours), but you can change them
using the tcp_keepalive_* knobs in /proc/sys/net/ipv4/ under Linux.

I suppose we can do shorter timeouts (at least under Linux) via
setsockopt(.. TCP_KEEP*) knobs, or we can call poll() ourselves
to timeout connections.  However, git packing operations on the
server can take a long time; so it might be bad to timeout
manually unless we know the connection is really dead.

> This appears to be what I am seeing, in a script that is attempting to do many
> successive clone operations, eg. of
> git://anongit.freedesktop.org/xorg/* , the script
> occasionally hangs in a clone - I can see with netstat + strace that the TCP
> connection is open and GIT is trying to read .
> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

I added git:// support for SO_KEEPALIVE in commit e47a8583a202
("enable SO_KEEPALIVE for connected TCP sockets")
back in 2011 (v1.7.10),
and http:// support later in 2013 (v1.8.5) with
commit a15d069a1986 ("http: enable keepalive on TCP sockets")

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: clone hang prevention / timeout?
  2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias
  2016-04-12  8:01 ` Eric Wong
@ 2016-04-13 22:24 ` Jeff King
  2016-04-13 22:29 ` Jeff King
  2 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2016-04-13 22:24 UTC (permalink / raw)
  To: Jason Vas Dias; +Cc: git

On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:

> It appears GIT has no way of specifying a timeout for a clone operation -
> if the server decides not to complete a get request, the clone can
> hang forever -
> is this correct ?

Yes. Git's protocol has no timeouts, though each side is generally
either writing or reading at any moment, and so an interrupted
connection should cause either EPIPE or EOF, ending the process. The
exceptions I have seen are:

 - protocol / implementation bugs that cause a true deadlock. At this
   we've fixed all known cases, but that doesn't mean there aren't bugs
   lurking.

 - the network drops out in such a way that the OS doesn't realize the
   connection is gone, and the reading side is left waiting for input
   forever

I think the TCP keepalive stuff that Eric mentioned should address the
latter, though I don't know how well it works in practice. We used to
sometimes see processes hung for days on GitHub, but it's been a long
time. I don't recall if it was pre-v1.8.5 (which introduced
SO_KEEPALIVE), or if we made some other change (we have a load-balancing
layer in front that has more aggressive timeouts).

> This appears to be what I am seeing, in a script that is attempting to do many
> successive clone operations, eg. of
> git://anongit.freedesktop.org/xorg/* , the script
> occasionally hangs in a clone - I can see with netstat + strace that the TCP
> connection is open and GIT is trying to read .
> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

There are periods where a git client may have to wait for a while in
read() while the other side is quiet (e.g., when the other side is badly
packed and needs to do a lot of up-front CPU work to prepare the
packfile). Since v1.8.4.2, the server side of a clone should generate
application-level keepalive packets, so that the client never sees
silence for more than ~5 seconds. The freedesktop servers appear to be
on v2.1.4, so a long read() as you're seeing probably is a real hang.

Note that pushing has a similar problem (the client may wait a long time
while the server chews on the uploaded packfile before reporting
status). There are no keepalives in that direction, though I have a
series there that I need to polish and submit.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: clone hang prevention / timeout?
  2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias
  2016-04-12  8:01 ` Eric Wong
  2016-04-13 22:24 ` Jeff King
@ 2016-04-13 22:29 ` Jeff King
  2016-04-14 18:32   ` Jason Vas Dias
  2 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-04-13 22:29 UTC (permalink / raw)
  To: Jason Vas Dias; +Cc: git

On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:

> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

Oh, one other thing you might consider, it something like "timeout" from
GNU coreutils, which puts a hard cap on the length of time a process can
run.

It's totally unaware of the state of the process, though, so if you
really do have a clone which takes an hour, it might very well kill it
at 99% complete. It has no mechanism for "gee, this process looks like
it hasn't done anything for 5 minutes".

I don't know offhand of a general tool for that.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: clone hang prevention / timeout?
  2016-04-13 22:29 ` Jeff King
@ 2016-04-14 18:32   ` Jason Vas Dias
  2016-04-30  9:04     ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Jason Vas Dias @ 2016-04-14 18:32 UTC (permalink / raw)
  To: Jeff King, Eric Wong; +Cc: git

Thanks very much Eric & Jeff for your reply .

Personally, I would recommend setting the SO_RECVTIMEO for GIT server
sockets to a fixed default (eg. 5mins) , settable by a
'--receive-timeout'   argument or configuration parameter .

The problem I was trying to overcome was cloning all the repositories under
https://anongit.freedesktop.org/xorg/* .

About 4 git clones would succeed in succession, but then typically the 5th
would hang in read() forever - I left one such hung 'git clone' for nearly an
hour and it had not progressed or timed out . I tried inserting a delay of
up to 30 seconds between clones, but this did not help.

Maybe freedesktop.org's GIT server is too overloaded and they have
to resort to disabling 1 out of 5 GIT successive clone operations from
same connection or something.

Here is my solution, in case anyone else needs it :

<quote><pre>
      eips=()
       counts=()
       declare -i failed=0;
       { echo "$BASHPID" >/tmp/git.pid;
         GIT_TRACE=2 exec git clone
${proto}://${user}anongit.freedesktop.org/${repo}$name; }&
       while [ ! -f /tmp/git.pid ]; do sleep 1; done
       git_pid="$(cat /tmp/git.pid)";
       while [ -d /proc/$git_pid ]; do
           IFS=$'\n';
           declare -a kids=($(ps --ppid $git_pid -o 'pid=,eip='));
           unset IFS;
           declare -i n_kids=${#kids[@]} kid_n;
           for ((kid_n=0; kid_n < n_kids; kid_n+=1)); do
             declare -a ke=(${kids[kid_n]});
             kid=${ke[0]}
             eip=${ke[1]}
             if [ ! -v 'eips['$kid']' ]; then
                eips[$kid]="$eip";
             elif [ "${eips[$kid]}" = "$eip" ]; then
                if [ x = x"${counts[$kid]}" ]; then
                   counts[$kid]=1;
                else
                   counts[$kid]=$((${counts[$kid]}+1));
                   if (( ${counts[$kid]} >= 30 )); then
                      echo 'child process '$kid' of git main process
'$git_pid' appears to be stuck - killing it.';
                      kill -TERM $kid;
                      ((failed=1));
                   fi
                fi
             else
                eips[$kid]="$eip";
                counts[$kid]='';
             fi
          done ;
          sleep 1;
       done
       wait
</quote></pre>

This is part of a script that reads a list of the Xorg projects,
sets $repo to top level subdirectory, and $name to the project name,
and initiates the GIT clone .
It deems any GIT _CHILD_ process (eg. git-index-pack) that have not
changed their instruction pointer register (EIP)  for 30 seconds to be
"hung" .
There is logic at the end to retry all the failed clones.
It does work, but is far from pretty .
It sure would be nice if GIT had a timeout mechanism !

Thanks & Regards,
Jason






On 13/04/2016, Jeff King <peff@peff.net> wrote:
> On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:
>
>> Is there any option I can specify to get the clone to timeout, or do I
>> manually
>> have to strace the git process and send it a signal after a hang is
>> detected?
>
> Oh, one other thing you might consider, it something like "timeout" from
> GNU coreutils, which puts a hard cap on the length of time a process can
> run.
>
> It's totally unaware of the state of the process, though, so if you
> really do have a clone which takes an hour, it might very well kill it
> at 99% complete. It has no mechanism for "gee, this process looks like
> it hasn't done anything for 5 minutes".
>
> I don't know offhand of a general tool for that.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: clone hang prevention / timeout?
  2016-04-14 18:32   ` Jason Vas Dias
@ 2016-04-30  9:04     ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2016-04-30  9:04 UTC (permalink / raw)
  To: Jason Vas Dias; +Cc: Jeff King, git

Jason Vas Dias <jason.vas.dias@gmail.com> wrote:
> Thanks very much Eric & Jeff for your reply .
> 
> Personally, I would recommend setting the SO_RECVTIMEO for GIT server
> sockets to a fixed default (eg. 5mins) , settable by a
> '--receive-timeout'   argument or configuration parameter .

(apologies for the delay, I thought I replied earlier :x)

SO_RCVTIMEO only triggers EAGAIN, and AFAIK the git read/write
wrappers are used to transparently retry on EAGAIN...  So it's
not so simple as doing a single setsockopt.

> The problem I was trying to overcome was cloning all the repositories under
> https://anongit.freedesktop.org/xorg/* .
> 
> About 4 git clones would succeed in succession, but then typically the 5th
> would hang in read() forever - I left one such hung 'git clone' for nearly an
> hour and it had not progressed or timed out . I tried inserting a delay of
> up to 30 seconds between clones, but this did not help.

Are you in contact with any of the admins of that server to
help?  Is the problematic repo any larger or in any way
stranger than the others?

> Maybe freedesktop.org's GIT server is too overloaded and they have
> to resort to disabling 1 out of 5 GIT successive clone operations from
> same connection or something.

Anyways I've been thinking about overloaded git servers, lately.
Pack generation on big repos is painful, and having lots of slow
clients can tie up server memory.  So maybe an HTTP server
which can switch between dumb and smart operation depending on
load could be useful for the resource-constrained.

> Here is my solution, in case anyone else needs it :

It'd be nice to get an strace to know where in the clone process
it hangs to help the admin figure out how far things got.

And please don't top-post, it's a waste of resources.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-30  9:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias
2016-04-12  8:01 ` Eric Wong
2016-04-13 22:24 ` Jeff King
2016-04-13 22:29 ` Jeff King
2016-04-14 18:32   ` Jason Vas Dias
2016-04-30  9:04     ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).