git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* reg. fatal: The remote end hung up unexpectedly on NFS
@ 2018-04-06 18:25 Satya Prakash GS
  2018-04-06 18:46 ` Ævar Arnfjörð Bjarmason
  2018-04-06 19:48 ` Jeff King
  0 siblings, 2 replies; 4+ messages in thread
From: Satya Prakash GS @ 2018-04-06 18:25 UTC (permalink / raw)
  To: git

Hi,

We have a distributed filesystem with NFS access. On the NFS mount, I
was doing a git-clone and if NFS server crashed and came back up while
the clone is going on, clone fails with the below message:

git clone https://satgs@github.com/fs/private-qa.git

remote: Counting objects: 139419, done.
remote: Compressing objects: 100% (504/504), done.
Receiving objects:   7% (9760/139419), 5.32 MiB | 5.27 MiB/s
error: RPC failed; result=18, HTTP code = 200 MiB | 96.00 KiB/s
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

On NFS server crash, it usually takes a minute or two for our
filesystem to failover to new NFS server. Initially I suspected it had
something to do with the filesystem, like attributes of the file
written by git weren't matching what it was expecting but the same
test fails on open source NFS server. While clone is going on, if I
stopped the open source NFS server for 2 minutes and restarted it, git
clone fails.

Another interesting thing is, if the restart happens within a few
seconds, git clone succeeds.

Sideband_demux fails while trying to read from the pipe. Read size
doesn't match what is expected. If there are 2 parts to git clone
which is fetching data and writing to local filesystem, is this error
happening while trying to fetch ? Since it succeeds if the restart is
done immediately, has this got something to do with the protocol
timeouts.

Please advise on how to debug this further.

Thanks,
Satya.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: reg. fatal: The remote end hung up unexpectedly on NFS
  2018-04-06 18:25 reg. fatal: The remote end hung up unexpectedly on NFS Satya Prakash GS
@ 2018-04-06 18:46 ` Ævar Arnfjörð Bjarmason
  2018-04-06 19:48 ` Jeff King
  1 sibling, 0 replies; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-04-06 18:46 UTC (permalink / raw)
  To: Satya Prakash GS; +Cc: git


On Fri, Apr 06 2018, Satya Prakash wrote:

> Hi,
>
> We have a distributed filesystem with NFS access. On the NFS mount, I
> was doing a git-clone and if NFS server crashed and came back up while
> the clone is going on, clone fails with the below message:
>
> git clone https://satgs@github.com/fs/private-qa.git
>
> remote: Counting objects: 139419, done.
> remote: Compressing objects: 100% (504/504), done.
> Receiving objects:   7% (9760/139419), 5.32 MiB | 5.27 MiB/s
> error: RPC failed; result=18, HTTP code = 200 MiB | 96.00 KiB/s
> fatal: The remote end hung up unexpectedly
> fatal: early EOF
> fatal: index-pack failed
>[...]
> Please advise on how to debug this further.

I don't know what this could be, but have you tried turning on the
various GIT_TRACE env variables document in "man git"? Some of those
might hold a clue to what's going on here.

Have you tried to do other git-independent tests to see if in general
the NFS failover isn't resulting in data being written as applications
expect? E.g. doing a recursive wget of some data with known SHA1-sums,
or doing a stess test of some sort on a filesystem that uses checksums
(such as btrfs), which could be loop-mounted to a file stored on NFS.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: reg. fatal: The remote end hung up unexpectedly on NFS
  2018-04-06 18:25 reg. fatal: The remote end hung up unexpectedly on NFS Satya Prakash GS
  2018-04-06 18:46 ` Ævar Arnfjörð Bjarmason
@ 2018-04-06 19:48 ` Jeff King
  2018-04-08 19:31   ` Satya Prakash GS
  1 sibling, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-04-06 19:48 UTC (permalink / raw)
  To: Satya Prakash GS; +Cc: git

On Fri, Apr 06, 2018 at 11:55:51PM +0530, Satya Prakash GS wrote:

> We have a distributed filesystem with NFS access. On the NFS mount, I
> was doing a git-clone and if NFS server crashed and came back up while
> the clone is going on, clone fails with the below message:
> 
> git clone https://satgs@github.com/fs/private-qa.git
> 
> remote: Counting objects: 139419, done.
> remote: Compressing objects: 100% (504/504), done.
> Receiving objects:   7% (9760/139419), 5.32 MiB | 5.27 MiB/s
> error: RPC failed; result=18, HTTP code = 200 MiB | 96.00 KiB/s
> fatal: The remote end hung up unexpectedly
> fatal: early EOF
> fatal: index-pack failed

Curl's result=18 is CURLE_PARTIAL_FILE. Usually that means the other
side hung up partway through. But given the NFS symptoms you describe, I
wonder if fwrite() to the file simply returned an error, and curl gave
up.

> On NFS server crash, it usually takes a minute or two for our
> filesystem to failover to new NFS server. Initially I suspected it had
> something to do with the filesystem, like attributes of the file
> written by git weren't matching what it was expecting but the same
> test fails on open source NFS server. While clone is going on, if I
> stopped the open source NFS server for 2 minutes and restarted it, git
> clone fails.
> 
> Another interesting thing is, if the restart happens within a few
> seconds, git clone succeeds.
> 
> Sideband_demux fails while trying to read from the pipe. Read size
> doesn't match what is expected. If there are 2 parts to git clone
> which is fetching data and writing to local filesystem, is this error
> happening while trying to fetch ? Since it succeeds if the restart is
> done immediately, has this got something to do with the protocol
> timeouts.
> 
> Please advise on how to debug this further.

If you're on Linux, strace could show you the write error. Unfortunately
it's a little tricky because the http bits happen in a sub-process. But
try:

  cat >/in/your/$PATH/git-remote-strace <<\EOF
  #!/bin/sh
  protocol=$(echo "$2" | cut -d: -f1)
  exec strace -f -o /tmp/foo.out git-remote-$protocol "$@"
  EOF
  chmod +x /in/your/$PATH/git-remote-strace

  git clone strace::https://github.com/whatever

My guess is you may find a failed `write()` in there.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: reg. fatal: The remote end hung up unexpectedly on NFS
  2018-04-06 19:48 ` Jeff King
@ 2018-04-08 19:31   ` Satya Prakash GS
  0 siblings, 0 replies; 4+ messages in thread
From: Satya Prakash GS @ 2018-04-08 19:31 UTC (permalink / raw)
  To: Jeff King, avarab; +Cc: git

Avarab,

I enabled GIT_TRACE, packet, etc. I could not make much sense of the
output. I downloaded git source and started looking at the code.
Yeah, kernel compilation with failovers worked on our filesystem. We
tried xfs test suite with failovers, it worked. Since it's failing
with open source NFS, I am wondering if this even has anything to do
with filesystem.

Jeff,

Thank you for the clear instructions on strace.

Looks like 5 and 6 are two ends of a pipe.

read(5, "\27\3\3\5r", 5)          = 5
read(5, "S\10n\230i\257\27\2\333&\335jou~\3036}S\273\212\250\33\310\216b\253\3505\332\266X"...,
1394) = 1394
write(6, "\2530\253x\27\247iz\246\321*\v\33\306.\221I\36\4\304D\\\360\361\306I\213\304\245m\247\30"...,
1370) = 1370
<... read resumed>
"\2530\253x\27\247iz\246\321*\v\33\306.\221I\36\4\304D\\\360\361\306I\213\304\245m\247\30"...,
5071) = 1370
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890},  <unfinished ...>
read(0,  <unfinished ...>
<... rt_sigaction resumed> NULL, 8) = 0
poll([{fd=5, events=POLLIN}], 1, 1000) = 1 ([{fd=5, revents=POLLIN}])
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, NULL, 8) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1
([{fd=5, revents=POLLIN|POLLRDNORM}])
read(5, "\27\3\3\5r", 5)          = 5
read(5, "S\10n\230i\257\27\3]T\0{\232\253\377$\265\277\211o.\314T\315\257\335j\322\367\31o\262"...,
1394) = 1394

>>>>>>>>>>>>> Write preceding the last read has succeeded
write(6, "\237%\204W$\236\177\305@\210+\236\227.\316\226\250\t\256:\27?};\270^A\317\204\222\35]"...,
1370) = 1370
>>>>>>>>

<... read resumed>
"\237%\204W$\236\177\305@\210+\236\227.\316\226\250\t\256:\27?};\270^A\317\204\222\35]"...,
3701) = 1370
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890},  <unfinished ...>
read(0,  <unfinished ...>
<... rt_sigaction resumed> NULL, 8) = 0
poll([{fd=5, events=POLLIN}], 1, 1000) = 1 ([{fd=5, revents=POLLIN}])
rt_sigaction(SIGPIPE, NULL, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, 8) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, NULL, 8) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1
([{fd=5, revents=POLLIN|POLLRDNORM}])
read(5, "\27\3\3\5r", 5)          = 5

>>>>> Problematic reads
read(5, "S\10n\230i\257\27\4\220\243\375\3772\213?\267\356V\246r\226`\223\253^\310\314\207\222\22%\376"...,
1394) = 363
read(5, "", 1031)                 = 0
>>>>>>>>>>>>>>>

write(5, "\25\3\3\0\32\265(Nk5\330Y\4!\374S\204\377\304\0166j\rV\305;e3\347&w",
31) = 31
close(5)

rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {0x441ed0, [PIPE], SA_RESTORER|SA_RESTART,
0x7f038e8bb890}, NULL, 8) = 0
write(2, "error: RPC failed; result=18, HT"..., 46) = 46
brk(0x1597000)                    = 0x1597000
close(6 <unfinished ...>
<... read resumed> "", 2331)      = 0
<... close resumed> )             = 0
read(7, "0000", 4096)             = 4
write(2, "fatal: The remote end hung up un"..., 43 <unfinished ...>
read(7,  <unfinished ...>
<... write resumed> )             = 43
close(4 <unfinished ...>
<... read resumed> "", 4096)      = 0
<... close resumed> )

Last read from pipe (fd 5) was a partial read. Write has succeeded on
the pipe but read failed, that's strange. In this case, it doesn't
even look like the write was interleaved.

Thanks,
Satya.

On Sat, Apr 7, 2018 at 1:18 AM, Jeff King <peff@peff.net> wrote:
> On Fri, Apr 06, 2018 at 11:55:51PM +0530, Satya Prakash GS wrote:
>
>> We have a distributed filesystem with NFS access. On the NFS mount, I
>> was doing a git-clone and if NFS server crashed and came back up while
>> the clone is going on, clone fails with the below message:
>>
>> git clone https://satgs@github.com/fs/private-qa.git
>>
>> remote: Counting objects: 139419, done.
>> remote: Compressing objects: 100% (504/504), done.
>> Receiving objects:   7% (9760/139419), 5.32 MiB | 5.27 MiB/s
>> error: RPC failed; result=18, HTTP code = 200 MiB | 96.00 KiB/s
>> fatal: The remote end hung up unexpectedly
>> fatal: early EOF
>> fatal: index-pack failed
>
> Curl's result=18 is CURLE_PARTIAL_FILE. Usually that means the other
> side hung up partway through. But given the NFS symptoms you describe, I
> wonder if fwrite() to the file simply returned an error, and curl gave
> up.
>
>> On NFS server crash, it usually takes a minute or two for our
>> filesystem to failover to new NFS server. Initially I suspected it had
>> something to do with the filesystem, like attributes of the file
>> written by git weren't matching what it was expecting but the same
>> test fails on open source NFS server. While clone is going on, if I
>> stopped the open source NFS server for 2 minutes and restarted it, git
>> clone fails.
>>
>> Another interesting thing is, if the restart happens within a few
>> seconds, git clone succeeds.
>>
>> Sideband_demux fails while trying to read from the pipe. Read size
>> doesn't match what is expected. If there are 2 parts to git clone
>> which is fetching data and writing to local filesystem, is this error
>> happening while trying to fetch ? Since it succeeds if the restart is
>> done immediately, has this got something to do with the protocol
>> timeouts.
>>
>> Please advise on how to debug this further.
>
> If you're on Linux, strace could show you the write error. Unfortunately
> it's a little tricky because the http bits happen in a sub-process. But
> try:
>
>   cat >/in/your/$PATH/git-remote-strace <<\EOF
>   #!/bin/sh
>   protocol=$(echo "$2" | cut -d: -f1)
>   exec strace -f -o /tmp/foo.out git-remote-$protocol "$@"
>   EOF
>   chmod +x /in/your/$PATH/git-remote-strace
>
>   git clone strace::https://github.com/whatever
>
> My guess is you may find a failed `write()` in there.
>
> -Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-04-08 19:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-06 18:25 reg. fatal: The remote end hung up unexpectedly on NFS Satya Prakash GS
2018-04-06 18:46 ` Ævar Arnfjörð Bjarmason
2018-04-06 19:48 ` Jeff King
2018-04-08 19:31   ` Satya Prakash GS

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).