git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Interrupted system call
       [not found] <14b3d372-f3fe-c06c-dd56-1d9799a12632@yahoo.de>
@ 2020-07-01  9:43 ` R. Diez
  2020-07-01 14:22   ` Santiago Torres Arias
  2020-07-01 16:21   ` Jeff King
  0 siblings, 2 replies; 7+ messages in thread
From: R. Diez @ 2020-07-01  9:43 UTC (permalink / raw)
  To: git

Hi all:

First of all, many thanks for Git.

After a 3-month pause, I recently updated my Ubuntu 18.04.4. I am using a PPA to keep Git more up to date, so I have now "git version 2.27.0".

I am now getting this kind of errors:

fatal: failed to read object cf965547a433493caa80e84d7a2b78b32a26ee35: Interrupted system call

error: unable to mmap /home/rdiez/[blah blah]/SrcRepo.git/objects/2e/f96ffba4c0d60f36c8779758f82752be380689: Interrupted system call

I am using a mount point for a network share. Keep in mind that Git thinks it is working on a local directory, so there should be no sockets 
or non-blocking I/O involved.

The problem is probably caused by using SMB to connect to an outdated Windows server. It has been working for years, but at some point in 
time it is bound to fail. The Linux kernel itself seems to introduce bugs in the SMB/CIFS code every now and then.

Nevertheless, I am surprised to get such an "Interrupted system call" from Git. A long time ago I learnt that it is OK for many syscalls to 
get interrupted, so you have to loop around them. See here for more information:

http://250bpm.com/blog:12

As a result, users should never actually get an "Interrupted system call" error from any software, at least when no sockets or non-blocking 
I/O is involved.

How can I pin-point this problem? I would like to know where Git is encountering this error, so that I can troubleshoot it, and maybe report 
yet another bug to the Linux SMB/CIFS maintainer.

Thanks in advance,
   rdiez

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-01  9:43 ` Interrupted system call R. Diez
@ 2020-07-01 14:22   ` Santiago Torres Arias
  2020-07-01 16:21   ` Jeff King
  1 sibling, 0 replies; 7+ messages in thread
From: Santiago Torres Arias @ 2020-07-01 14:22 UTC (permalink / raw)
  To: R. Diez; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]

Hi, 

> Nevertheless, I am surprised to get such an "Interrupted system call" from
> Git. A long time ago I learnt that it is OK for many syscalls to get
> interrupted, so you have to loop around them. See here for more information:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__250bpm.com_blog-3A12&d=DwICaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=yZMPY-APGKyVIX7HgQFZJA&m=JwtG1XJ8aqvchYKsbjW23-PqEl4qm4xuOrYLaF8MOK4&s=k58MMdPdIRPl0kpuTohwZo_3GbW7elvojU1wjTil2GY&e=
> 
> As a result, users should never actually get an "Interrupted system call"
> error from any software, at least when no sockets or non-blocking I/O is
> involved.

I'm not sure if you can blame git right away (it could be an underlying
library), and I'm also not convinced that "interrupted system call" is
an error that should never exist for users (error handling is generally
very nuanced).

I'd advice to use GIT_TRACE_FSMONITOR or just GIT_TRACE to figure out
what component is the last one in place before things failed. You can
read about these on the manpage on the "other" subsection of the
"ENVIRONMENT VARIABLES" section.

I hope this helps!
-Santiago

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-01  9:43 ` Interrupted system call R. Diez
  2020-07-01 14:22   ` Santiago Torres Arias
@ 2020-07-01 16:21   ` Jeff King
  2020-07-02  7:07     ` R. Diez
  2020-07-12  8:41     ` R. Diez
  1 sibling, 2 replies; 7+ messages in thread
From: Jeff King @ 2020-07-01 16:21 UTC (permalink / raw)
  To: R. Diez; +Cc: git

On Wed, Jul 01, 2020 at 11:43:15AM +0200, R. Diez wrote:

> After a 3-month pause, I recently updated my Ubuntu 18.04.4. I am
> using a PPA to keep Git more up to date, so I have now "git version
> 2.27.0".
> 
> I am now getting this kind of errors:
> 
> fatal: failed to read object cf965547a433493caa80e84d7a2b78b32a26ee35: Interrupted system call
> 
> error: unable to mmap /home/rdiez/[blah blah]/SrcRepo.git/objects/2e/f96ffba4c0d60f36c8779758f82752be380689: Interrupted system call
> 
> I am using a mount point for a network share. Keep in mind that Git thinks
> it is working on a local directory, so there should be no sockets or
> non-blocking I/O involved.

Looking at the code, that message is slightly deceptive. It's reporting
a failure from map_loose_object_1(), which calls both open() and mmap(),
as well as fstat().  It would be interesting to know which syscall is
actually failing. Running the failure case under "strace" would be
interesting (likewise to see which signal is causing the interruption).

> The problem is probably caused by using SMB to connect to an outdated
> Windows server. It has been working for years, but at some point in time it
> is bound to fail. The Linux kernel itself seems to introduce bugs in the
> SMB/CIFS code every now and then.
> 
> Nevertheless, I am surprised to get such an "Interrupted system call" from
> Git. A long time ago I learnt that it is OK for many syscalls to get
> interrupted, so you have to loop around them. See here for more information:

We do check for signals and re-start read() and write() calls as
appropriate. We don't for open(), and nobody has ever complained (though
it definitely is documented to result in EINTR, I'd imagine it's
relatively rare). I'm not excited about the prospect of adding retry
code to every open(), though perhaps doing it with our git_open()
wrapper would be sufficient (it's unclear how stdio fopen() behaves).

> How can I pin-point this problem? I would like to know where Git is
> encountering this error, so that I can troubleshoot it, and maybe report yet
> another bug to the Linux SMB/CIFS maintainer.

I think the first step is using strace to record the system call
returning EINTR (and the signal that interrupted it). I suspect it's in
open(), though, and probably not a bug: opening network files may take a
while and need to be interruptable.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-01 16:21   ` Jeff King
@ 2020-07-02  7:07     ` R. Diez
  2020-07-15  9:38       ` Jeff King
  2020-07-12  8:41     ` R. Diez
  1 sibling, 1 reply; 7+ messages in thread
From: R. Diez @ 2020-07-02  7:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git, santiago


 > [...]
> It would be interesting to know which syscall is
> actually failing. Running the failure case under "strace" would be
> interesting (likewise to see which signal is causing the interruption).
 > [...]


First of all, thanks for your help.

GIT_TRACE alone does not tell me anything useful:

$ GIT_TRACE=true git fsck
07:58:47.229138 git.c:442               trace: built-in: git fsck
error: unable to mmap ./objects/cb/fec04963c1090535d2670b741912e17fd27b27: Interrupted system call
error: cbfec04963c1090535d2670b741912e17fd27b27: object corrupt or missing: ./objects/cb/fec04963c1090535d2670b741912e17fd27b27
Checking object directories: 100% (256/256), done.
Checking objects: 100% (70229/70229), done.
Checking connectivity: 75316, done.
missing commit cbfec04963c1090535d2670b741912e17fd27b27
dangling commit 6835e962b227e957520addbc5c28aedc97b253f3
dangling tree a9d1a1321066d8a8402f1c9e584675146d250952


GIT_TRACE_FSMONITOR does not either:

$ GIT_TRACE_FSMONITOR=true git fsck 

error: unable to mmap ./objects/56/af267465e7cdb7ccebe8242e55c03d4b675684: Interrupted system call
error: 56af267465e7cdb7ccebe8242e55c03d4b675684: object corrupt or missing: ./objects/56/af267465e7cdb7ccebe8242e55c03d4b675684
Checking object directories: 100% (256/256), done.
Checking objects: 100% (70229/70229), done.
Checking connectivity: 75666, done.
missing tree 56af267465e7cdb7ccebe8242e55c03d4b675684

It is the same Git repository, so it looks like every time a different, random file fails.


I managed to make it fail once with:

   strace -f -- git fsck --progress

The signal involved is SIGALRM. I am guessing that Git is setting it up in order to display its progress messages. This is one of the few 
calls to rt_sigaction(SIGALRM):

rt_sigaction(SIGALRM, {sa_handler=0x556c8ac0fe80, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fbdca7da890}, NULL, 8) = 0


This is the first failure:

openat(AT_FDCWD, "./objects/11/a327f469cc40015d6d873f6eed328e977c4234", O_RDONLY|O_CLOEXEC) = -1 EINTR (Interrupted system call)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/locale-langpack/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, "error: unable to mmap ./objects/"..., 99error: unable to mmap ./objects/11/a327f469cc40015d6d873f6eed328e977c4234: Interrupted 
system call
) = 99
write(2, "error: 11a327f469cc40015d6d873f6"..., 128error: 11a327f469cc40015d6d873f6eed328e977c4234: object corrupt or missing: 
./objects/11/a327f469cc40015d6d873f6eed328e977c4234
) = 128


This is the second one:

openat(AT_FDCWD, "./objects/18/5b82729943708795b635899348ecca97aa7804", O_RDONLY|O_CLOEXEC) = -1 EINTR (Interrupted system call)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
write(2, "error: unable to mmap ./objects/"..., 99error: unable to mmap ./objects/18/5b82729943708795b635899348ecca97aa7804: Interrupted 
system call
) = 99
write(2, "error: 185b82729943708795b635899"..., 128error: 185b82729943708795b635899348ecca97aa7804: object corrupt or missing: 
./objects/18/5b82729943708795b635899348ecca97aa7804
) = 128

There are a few more failures.

This is the last one. Afterwards, Git exited:

openat(AT_FDCWD, "./objects/f4/56439700761946c57ef467a8a125a80f0304bd", O_RDONLY|O_CLOEXEC) = -1 EINTR (Interrupted system call)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
openat(AT_FDCWD, "./objects/pack", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
brk(0x556c934af000)                     = 0x556c934af000
getdents(3, /* 19 entries */, 1048576)  = 1272
getdents(3, /* 0 entries */, 1048576)   = 0
close(3)                                = 0
write(2, "fatal: failed to read object f45"..., 95fatal: failed to read object f456439700761946c57ef467a8a125a80f0304bd: Interrupted system call
) = 95
exit_group(128)                         = ?
+++ exited with 128 +++


I am not an expert in Unix signals, but I'll do my best here.

I do not understand why Git is getting these interruptions due to SIGALRM, because SA_RESTART is in place.

Interestingly, the man page signal(7) does list open() under that flag, but not openat().

The description for open() under SA_RESTART is also interesting:

* open(2), if it can block (e.g., when opening a FIFO; see fifo(7)).

I am not sure that opening a normal disk file may qualify as "can block" with that definition though.

Best regards,
   rdiez

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-01 16:21   ` Jeff King
  2020-07-02  7:07     ` R. Diez
@ 2020-07-12  8:41     ` R. Diez
  1 sibling, 0 replies; 7+ messages in thread
From: R. Diez @ 2020-07-12  8:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git


> fatal: failed to read object cf965547a433493caa80e84d7a2b78b32a26ee35: Interrupted system call
 > [...]

In case anybody else has the same problem and finds this thread in the future, the workaround I am using is to disable progress messages.

For "git pull", "git gc" and "git push" the appropriate option is "--quiet", but for "git fsck" it is "--no-progress".

Best regards,
   rdiez

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-02  7:07     ` R. Diez
@ 2020-07-15  9:38       ` Jeff King
  2020-07-15 16:06         ` Chris Torek
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2020-07-15  9:38 UTC (permalink / raw)
  To: R. Diez; +Cc: git, santiago

On Thu, Jul 02, 2020 at 09:07:46AM +0200, R. Diez wrote:

> I managed to make it fail once with:
> 
>   strace -f -- git fsck --progress
> 
> The signal involved is SIGALRM. I am guessing that Git is setting it up in
> order to display its progress messages. This is one of the few calls to
> rt_sigaction(SIGALRM):
> 
> rt_sigaction(SIGALRM, {sa_handler=0x556c8ac0fe80, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fbdca7da890}, NULL, 8) = 0

That makes sense (and likewise your "--quiet" workaround seems
reasonable).

> I am not an expert in Unix signals, but I'll do my best here.
> 
> I do not understand why Git is getting these interruptions due to SIGALRM, because SA_RESTART is in place.
> 
> Interestingly, the man page signal(7) does list open() under that flag, but not openat().

Yes, though since open(2) says:

 The openat() system call operates in exactly the same way as open(),
 except for the differences described here.

I'd expect that would include any SA_RESTART handling. Peeking at the
Linux implementation in fs/open.c, it looks like both syscalls quickly
end up in the same do_sys_open().

> The description for open() under SA_RESTART is also interesting:
> 
> * open(2), if it can block (e.g., when opening a FIFO; see fifo(7)).
> 
> I am not sure that opening a normal disk file may qualify as "can block" with that definition though.

Delivering EINTR on a non-blocking call seems even more confusing,
though. I think the "if it can block" is just "you won't even get a
signal if it's not blocking".

This really _seems_ like a kernel bug, either:

  - openat() does not get the same SA_RESTART treatment as open(); or

  - open() on a network file can get EINTR even with SA_RESTART

But it's quite possible that I'm missing some corner case or historical
reason that it would need to behave the way you're seeing. It might be
worth reporting to kernel folks.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interrupted system call
  2020-07-15  9:38       ` Jeff King
@ 2020-07-15 16:06         ` Chris Torek
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Torek @ 2020-07-15 16:06 UTC (permalink / raw)
  To: Jeff King; +Cc: R. Diez, Git List, santiago

On Wed, Jul 15, 2020 at 2:45 AM Jeff King <peff@peff.net> wrote:
> On Thu, Jul 02, 2020 at 09:07:46AM +0200, R. Diez wrote:
> > I do not understand why Git is getting these interruptions due to SIGALRM, because SA_RESTART is in place.

It really shouldn't -- that's the whole point of SA_RESTART.

> Delivering EINTR on a non-blocking call seems even more confusing,
> though. I think the "if it can block" is just "you won't even get a
> signal if it's not blocking".
>
> This really _seems_ like a kernel bug, either:
>
>   - openat() does not get the same SA_RESTART treatment as open(); or
>
>   - open() on a network file can get EINTR even with SA_RESTART
>
> But it's quite possible that I'm missing some corner case or historical
> reason that it would need to behave the way you're seeing. It might be
> worth reporting to kernel folks.
>
> -Peff

Right.  This goes way back to pre-v7-Unix signals, as a sort of a
side effect of the implementation.  In ancient times, the kernel
code for the internal wait-for-some-event took a priority number,
and anything below a cutoff value meant "not interrupted by
signals" while anything above it meant "interrupted by signals".
Disk operations were all at PRIBIO which was never interrupted.

This is all quite different in modern systems and hence it's all
adjustable, but in general we like to distinguish between
"operations that will definitely complete fairly quickly"
(normally not interrupted) and "operations that might take
significant amounts of time" (normally interrupted with the option
of restarting the system call).

*Restarting*, though, means exactly that: not *resuming*, but
*restarting*.  So whatever system call is to be interrupted by the
signal *must* be one that can simply be started over from the
beginning.  That means, for instance, that read() or write() can
only be restarted if no data have yet moved.  So if you're in a
read() on a device (e.g., serial port, or tape drive, or whatever)
and have gotten a few bytes, but not yet all you wanted, and then
the system call is to be interrupted by a signal, the read() must
return with a short count.

An open() can be restarted on the assumption that no path names
have been changed.  That's not necessarily a good assumption,
but it's traditional.  The openat() can be restarted for the same
reason (and in fact correct use of openat() can protect against
some pathname issues).  It's up to the programmer to decide
whether to use SA_RESTART, and hence allow this, or not.

Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-15 16:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <14b3d372-f3fe-c06c-dd56-1d9799a12632@yahoo.de>
2020-07-01  9:43 ` Interrupted system call R. Diez
2020-07-01 14:22   ` Santiago Torres Arias
2020-07-01 16:21   ` Jeff King
2020-07-02  7:07     ` R. Diez
2020-07-15  9:38       ` Jeff King
2020-07-15 16:06         ` Chris Torek
2020-07-12  8:41     ` R. Diez

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).