git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] tmp-objdir: do not opendir() when handling a signal
@ 2022-09-26 23:53 John Cai via GitGitGadget
  2022-09-27  0:18 ` Taylor Blau
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: John Cai via GitGitGadget @ 2022-09-26 23:53 UTC (permalink / raw)
  To: git; +Cc: John Cai, John Cai

From: John Cai <johncai86@gmail.com>

In the tmp-objdir api, tmp_objdir_create will create a temporary
directory but also register signal handlers responsible for removing
the directory's contents and the directory itself. However, the
function responsible for recursively removing the contents and
directory, remove_dir_recurse() calls opendir(3) and closedir(3).
This can be problematic because these functions allocate and free
memory, which are not async-signal-safe functions. This can lead to
deadlocks.

One place we call tmp_objdir_create() is in git-receive-pack, where
we create a temporary quarantine directory "incoming". Incoming
objects will be written to this directory before they get moved to
the object directory.

We have observed this code leading to a deadlock:

	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
		<main_arena>) at ./lowlevellock.c:35
	#1  0x00007f621baa635b in __GI___libc_malloc
		(bytes=bytes@entry=32816) at malloc.c:3064
	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
		flags=0, close_fd=true, fd=5)
		at ../sysdeps/posix/opendir.c:118
	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
	#4  __opendir (name=<optimized out>)
		at ../sysdeps/posix/opendir.c:92
	#5  0x0000557c19c77de1 in remove_dir_recurse ()
	#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
	#7  <signal handler called>
	#8  _int_malloc (av=av@entry=0x7f621bbf8b80 <main_arena>,
		bytes=bytes@entry=7160) at malloc.c:4116
	#9  0x00007f621baa62c9 in __GI___libc_malloc (bytes=7160)
		at malloc.c:3066
	#10 0x00007f621bd1e987 in inflateInit2_ ()
		from /opt/gitlab/embedded/lib/libz.so.1
	#11 0x0000557c19dbe5f4 in git_inflate_init ()
	#12 0x0000557c19cee02a in unpack_compressed_entry ()
	#13 0x0000557c19cf08cb in unpack_entry ()
	#14 0x0000557c19cf0f32 in packed_object_info ()
	#15 0x0000557c19cd68cd in do_oid_object_info_extended ()
	#16 0x0000557c19cd6e2b in read_object_file_extended ()
	#17 0x0000557c19cdec2f in parse_object ()
	#18 0x0000557c19c34977 in lookup_commit_reference_gently ()
	#19 0x0000557c19d69309 in mark_uninteresting ()
	#20 0x0000557c19d2d180 in do_for_each_repo_ref_iterator ()
	#21 0x0000557c19d21678 in for_each_ref ()
	#22 0x0000557c19d6a94f in assign_shallow_commits_to_refs ()
	#23 0x0000557c19bc02b2 in cmd_receive_pack ()
	#24 0x0000557c19b29fdd in handle_builtin ()
	#25 0x0000557c19b2a526 in cmd_main ()
	#26 0x0000557c19b28ea2 in main ()

To prevent this, add a flag REMOVE_DIR_SIGNAL that allows
remove_dir_recurse() to know that a signal is being handled and avoid
calling opendir(3). One implication of this change is that when
signal handling, the temporary directory may not get cleaned up
properly.

Signed-off-by: John Cai <johncai86@gmail.com>
---
    tmp-objdir: do not closedir() when handling a signal
    
    We have recently observed a Git process hanging around for weeks. A
    backtrace revealed that a git-receive-pack(1) process was deadlocked
    when trying to remove the quarantine directory "incoming." It turns out
    that the tmp_objdir API calls opendir(3) and closedir(3) to observe a
    directory's contents in order to remove all the contents before removing
    the directory itself. These functions are not async signal save as they
    allocate and free memory.
    
    The fix is to avoid calling these functions when handling a signal in
    order to avoid a deadlock. The implication of such a fix however, is
    that temporary object directories may not get cleaned up properly when a
    signal is being handled. The tradeoff this fix is making is to prevent
    deadlocks at the cost of temporary object directory cleanup.
    
    This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when
    cleaning up lockfiles in async signals)

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1348%2Fjohn-cai%2Fjc%2Ffix-tmp-objdir-signal-handling-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1348/john-cai/jc/fix-tmp-objdir-signal-handling-v1
Pull-Request: https://github.com/git/git/pull/1348

 dir.c        | 7 +++++--
 dir.h        | 3 +++
 tmp-objdir.c | 7 ++++++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/dir.c b/dir.c
index 75429508200..a3183cb043f 100644
--- a/dir.c
+++ b/dir.c
@@ -3244,7 +3244,7 @@ void strip_dir_trailing_slashes(char *dir)
 
 static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
 {
-	DIR *dir;
+	DIR *dir = NULL;
 	struct dirent *e;
 	int ret = 0, original_len = path->len, len, kept_down = 0;
 	int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY);
@@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
 	}
 
 	flag &= ~REMOVE_DIR_KEEP_TOPLEVEL;
-	dir = opendir(path->buf);
+
+	if ((flag & REMOVE_DIR_SIGNAL) == 0)
+		dir = opendir(path->buf);
+
 	if (!dir) {
 		if (errno == ENOENT)
 			return keep_toplevel ? -1 : 0;
diff --git a/dir.h b/dir.h
index 674747d93af..ba159f4abeb 100644
--- a/dir.h
+++ b/dir.h
@@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl);
 /* Remove the_original_cwd too */
 #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08
 
+/* Indicates a signal is being handled */
+#define REMOVE_DIR_SIGNAL 0x16
+
 /*
  * Remove path and its contents, recursively. flags is a combination
  * of the above REMOVE_DIR_* constants. Return 0 on success.
diff --git a/tmp-objdir.c b/tmp-objdir.c
index adf6033549e..13943098738 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -34,6 +34,7 @@ static void tmp_objdir_free(struct tmp_objdir *t)
 static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 {
 	int err;
+	int flags = 0;
 
 	if (!t)
 		return 0;
@@ -49,7 +50,11 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	 * have pre-grown t->path sufficiently so that this
 	 * doesn't happen in practice.
 	 */
-	err = remove_dir_recursively(&t->path, 0);
+
+	if (on_signal)
+		flags = flags | REMOVE_DIR_SIGNAL;
+
+	err = remove_dir_recursively(&t->path, flags);
 
 	/*
 	 * When we are cleaning up due to a signal, we won't bother

base-commit: 4fd6c5e44459e6444c2cd93383660134c95aabd1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
@ 2022-09-27  0:18 ` Taylor Blau
  2022-09-27 11:48   ` Jeff King
  2022-09-27  1:39 ` Junio C Hamano
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Taylor Blau @ 2022-09-27  0:18 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, John Cai, Jeff King

[+cc Peff as the author of tmp-objdir]

On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote:
> One place we call tmp_objdir_create() is in git-receive-pack, where
> we create a temporary quarantine directory "incoming". Incoming
> objects will be written to this directory before they get moved to
> the object directory.

Right, calling opendir() will allocate memory, so we'll get stuck in a
deadlock if the signal arrives while libc's allocator lock is held. So
we can't safely call opendir() there.

It does make me a little uneasy leaving the quarantine directory around
via this path. So I wonder if we should be optimistically opening up the
DIR handle? Calling unlink() in a signal is perfectly fine, so I'd think
as long as we have an open DIR handle we could call readdir_r(), but I
don't think we've discussed it before.

> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>  	}
>
>  	flag &= ~REMOVE_DIR_KEEP_TOPLEVEL;
> -	dir = opendir(path->buf);
> +
> +	if ((flag & REMOVE_DIR_SIGNAL) == 0)

Comparing to the zero value is discouraged. Consider:

    if (!(flag & REMOVE_DIR_SIGNAL))

instead.


> @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl);
>  /* Remove the_original_cwd too */
>  #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08
>
> +/* Indicates a signal is being handled */
> +#define REMOVE_DIR_SIGNAL 0x16
> +

Perhaps REMOVE_DIR_IN_SIGNAL would be slightly more descriptive.

> @@ -49,7 +50,11 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
>  	 * have pre-grown t->path sufficiently so that this
>  	 * doesn't happen in practice.
>  	 */
> -	err = remove_dir_recursively(&t->path, 0);
> +
> +	if (on_signal)
> +		flags = flags | REMOVE_DIR_SIGNAL;

I'm nitpicking, but you could just write "flags |= REMOVE_DIR_SIGNAL",
or even:

    err = remove_dir_recursively(&t->path,
                                 on_signal ? REMOVE_DIR_SIGNAL : 0);

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
  2022-09-27  0:18 ` Taylor Blau
@ 2022-09-27  1:39 ` Junio C Hamano
  2022-09-27  9:18 ` Phillip Wood
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2022-09-27  1:39 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, John Cai

"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:

>     tmp-objdir: do not closedir() when handling a signal
>     
>     We have recently observed a Git process hanging around for weeks. A
>     backtrace revealed that a git-receive-pack(1) process was deadlocked
>     when trying to remove the quarantine directory "incoming." It turns out
>     that the tmp_objdir API calls opendir(3) and closedir(3) to observe a
>     directory's contents in order to remove all the contents before removing
>     the directory itself. These functions are not async signal save as they
>     allocate and free memory.
>     
>     The fix is to avoid calling these functions when handling a signal in
>     order to avoid a deadlock. The implication of such a fix however, is
>     that temporary object directories may not get cleaned up properly when a
>     signal is being handled. The tradeoff this fix is making is to prevent
>     deadlocks at the cost of temporary object directory cleanup.
>     
>     This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when
>     cleaning up lockfiles in async signals)

Hmph, is it really similar?  That one, even though the lockfiles
won't be cleaned up inside signal handler, they will eventually be
cleaned, won't they?  As opposed to here, once we punt, we punt and
do not revisit when we re-raise and eventually exit, no?

Leaving temporary directories behind is MUCH MUCH better than
getting stuck in a deadlock, so it is much better than the status
quo, of course.

>  static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>  {
> -	DIR *dir;
> +	DIR *dir = NULL;
>  	struct dirent *e;
>  	int ret = 0, original_len = path->len, len, kept_down = 0;
>  	int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY);
> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>  	}
>  
>  	flag &= ~REMOVE_DIR_KEEP_TOPLEVEL;
> -	dir = opendir(path->buf);
> +
> +	if ((flag & REMOVE_DIR_SIGNAL) == 0)
> +		dir = opendir(path->buf);
> +
>  	if (!dir) {
>  		if (errno == ENOENT)
>  			return keep_toplevel ? -1 : 0;
> diff --git a/dir.h b/dir.h
> index 674747d93af..ba159f4abeb 100644
> --- a/dir.h
> +++ b/dir.h
> @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl);
>  /* Remove the_original_cwd too */
>  #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08
>  
> +/* Indicates a signal is being handled */
> +#define REMOVE_DIR_SIGNAL 0x16
> +
>  /*
>   * Remove path and its contents, recursively. flags is a combination
>   * of the above REMOVE_DIR_* constants. Return 0 on success.
> diff --git a/tmp-objdir.c b/tmp-objdir.c

The fix looks quite straight-forward.

Thanks for spotting and working on this issue.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
  2022-09-27  0:18 ` Taylor Blau
  2022-09-27  1:39 ` Junio C Hamano
@ 2022-09-27  9:18 ` Phillip Wood
  2022-09-27 11:44 ` Jeff King
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Phillip Wood @ 2022-09-27  9:18 UTC (permalink / raw)
  To: John Cai via GitGitGadget, git; +Cc: John Cai

Hi John

On 27/09/2022 00:53, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> 
> In the tmp-objdir api, tmp_objdir_create will create a temporary
> directory but also register signal handlers responsible for removing
> the directory's contents and the directory itself. However, the
> function responsible for recursively removing the contents and
> directory, remove_dir_recurse() calls opendir(3) and closedir(3).
> This can be problematic because these functions allocate and free
> memory, which are not async-signal-safe functions. This can lead to
> deadlocks.

> --- a/dir.h
> +++ b/dir.h
> @@ -498,6 +498,9 @@ int get_sparse_checkout_patterns(struct pattern_list *pl);
>   /* Remove the_original_cwd too */
>   #define REMOVE_DIR_PURGE_ORIGINAL_CWD 0x08
>   
> +/* Indicates a signal is being handled */
> +#define REMOVE_DIR_SIGNAL 0x16

This is setting the bits for REMOVE_DIR_KEEP_NESTED_GIT and 
REMOVE_DIR_KEEP_TOPLEVEL is that intentional? (it looks like you've 
doubled 8 to 16 to get the next free bit but used a hex constant, the 
earlier constants use decimal)

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-09-27  9:18 ` Phillip Wood
@ 2022-09-27 11:44 ` Jeff King
  2022-09-27 13:50   ` John Cai
  2022-09-27 16:50   ` Junio C Hamano
  2022-09-27 19:19 ` [PATCH v2] tmp-objdir: skip clean up " John Cai via GitGitGadget
  2022-10-20 11:58 ` Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal) Jan Pokorný
  5 siblings, 2 replies; 18+ messages in thread
From: Jeff King @ 2022-09-27 11:44 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, John Cai

On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote:

> One place we call tmp_objdir_create() is in git-receive-pack, where
> we create a temporary quarantine directory "incoming". Incoming
> objects will be written to this directory before they get moved to
> the object directory.
> 
> We have observed this code leading to a deadlock:
> 
> 	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
> 	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
> 		<main_arena>) at ./lowlevellock.c:35
> 	#1  0x00007f621baa635b in __GI___libc_malloc
> 		(bytes=bytes@entry=32816) at malloc.c:3064
> 	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
> 		flags=0, close_fd=true, fd=5)
> 		at ../sysdeps/posix/opendir.c:118
> 	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
> 	#4  __opendir (name=<optimized out>)
> 		at ../sysdeps/posix/opendir.c:92
> 	#5  0x0000557c19c77de1 in remove_dir_recurse ()
> 	#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
> 	#7  <signal handler called>

Yuck. Your analysis looks right, and certainly opendir() can't really
work without allocating memory for the pointer-to-DIR.

> To prevent this, add a flag REMOVE_DIR_SIGNAL that allows
> remove_dir_recurse() to know that a signal is being handled and avoid
> calling opendir(3). One implication of this change is that when
> signal handling, the temporary directory may not get cleaned up
> properly.

It's not really "may not" here, is it? It will never get cleaned up on a
signal now. I don't think remove_dir_recursively() will try to rmdir()
in this case. But even if it did, we'll always have a "pack"
subdirectory (minus the small race before we've created it).

That's unfortunate, but I don't think we really have a portable
alternative. We can't keep an exact list of files to be deleted, because
some of them will be created by sub-processes. We could perhaps exec a
helper that does the deletion, but that seems like a race and
portability nightmare. On Linux, we could probably use open() and
getdents64() to traverse, but obviously that won't work everywhere. It
_might_ be worth having some kind of compat/ wrapper here, to let
supported systems do as good a job as they can. But it's not like there
aren't already cases where we might leave the tmp-objdir directory
around (say, SIGKILL), so this is really just extending the existing
problem to more signals.

I was going to suggest we should do a better job of cleaning up these
directories via git-gc. But it looks like b3cecf49ea (tmp-objdir: new
API for creating temporary writable databases, 2021-12-06) changed the
default name such that a regular git-gc should do so. So I think we're
covered there.

The main case we care about is normal exit when index-pack or a hook
sees an error, in which case we should still be cleaning up via the
atexit() handler.

So I think your patch is going in the right direction, but...

>  static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>  {
> -	DIR *dir;
> +	DIR *dir = NULL;
>  	struct dirent *e;
>  	int ret = 0, original_len = path->len, len, kept_down = 0;
>  	int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY);
> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>  	}
>  
>  	flag &= ~REMOVE_DIR_KEEP_TOPLEVEL;
> -	dir = opendir(path->buf);
> +
> +	if ((flag & REMOVE_DIR_SIGNAL) == 0)
> +		dir = opendir(path->buf);
> +
>  	if (!dir) {
>  		if (errno == ENOENT)
>  			return keep_toplevel ? -1 : 0;

We skip calling opendir() entirely, so "dir" will still be NULL. But we
immediately start looking at errno, which will have some undefined value
(based on some previous syscall).

If we set "errno" to "EACCES" in this case, then we'd at least hit the
rmdir() below:

         if (!dir) {
                  if (errno == ENOENT)
                          return keep_toplevel ? -1 : 0;
                  else if (errno == EACCES && !keep_toplevel)
                          /*
                           * An empty dir could be removable even if it
                           * is unreadable:
                           */
                          return rmdir(path->buf);
                  else
                          return -1;
          }

but we know it won't really do anything for our proposed caller, since
it will have files inside the directory that need to be removed before
rmdir() can work.

Moreover, if you were to combine REMOVE_DIR_SIGNAL with
REMOVE_DIR_KEEP_NESTED_GIT, I suspect that the call to
resolve_gitlink_ref() would end up with similar deadlocks. Obviously
nobody is proposing to do that, but it is a pitfall in the API.

So all of that makes me think we should not add a new flag here, but
instead just avoid calling the function entirely from
tmp_objdir_destroy_1().

But then we can observe that tmp_objdir_destroy_1() is basically doing
nothing if on_signal is set. So there is really no point in setting up
the signal handler at all. We should just set up the atexit() handler.
I.e., something like:

diff --git a/tmp-objdir.c b/tmp-objdir.c
index a8be92bca1..10549e95db 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	the_tmp_objdir = t;
 	if (!installed_handlers) {
 		atexit(remove_tmp_objdir);
-		sigchain_push_common(remove_tmp_objdir_on_signal);
 		installed_handlers++;
 	}
 

with the commit message explaining that we can't do the cleanup in a
portable and signal-safe way, so we just punt on the whole concept.

There's also some minor cleanup we could do elsewhere to drop the
"on_signal" argument (which can come as part of the same patch, or on
top).

-Peff

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-27  0:18 ` Taylor Blau
@ 2022-09-27 11:48   ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2022-09-27 11:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: John Cai via GitGitGadget, git, John Cai

On Mon, Sep 26, 2022 at 08:18:47PM -0400, Taylor Blau wrote:

> It does make me a little uneasy leaving the quarantine directory around
> via this path. So I wonder if we should be optimistically opening up the
> DIR handle? Calling unlink() in a signal is perfectly fine, so I'd think
> as long as we have an open DIR handle we could call readdir_r(), but I
> don't think we've discussed it before.

You'd need to hold multiple such DIRs, since the removal is recursive.
It's easy-ish for the "pack" directory, but the sub-process index-pack
may have created 00-ff directories, too. You'd have to pre-create and
opendir() all of them.

I'm also not sure what timing guarantees we have. If I opendir() a
directory, then wait a long time while somebody else creates entries, is
a readdir() guaranteed to see those new entries?

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-27 11:44 ` Jeff King
@ 2022-09-27 13:50   ` John Cai
  2022-09-27 19:03     ` Jeff King
  2022-09-27 16:50   ` Junio C Hamano
  1 sibling, 1 reply; 18+ messages in thread
From: John Cai @ 2022-09-27 13:50 UTC (permalink / raw)
  To: Jeff King; +Cc: John Cai via GitGitGadget, git

Hey Peff,

On 27 Sep 2022, at 7:44, Jeff King wrote:

> On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote:
>
>> One place we call tmp_objdir_create() is in git-receive-pack, where
>> we create a temporary quarantine directory "incoming". Incoming
>> objects will be written to this directory before they get moved to
>> the object directory.
>>
>> We have observed this code leading to a deadlock:
>>
>> 	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
>> 	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
>> 		<main_arena>) at ./lowlevellock.c:35
>> 	#1  0x00007f621baa635b in __GI___libc_malloc
>> 		(bytes=bytes@entry=32816) at malloc.c:3064
>> 	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
>> 		flags=0, close_fd=true, fd=5)
>> 		at ../sysdeps/posix/opendir.c:118
>> 	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
>> 	#4  __opendir (name=<optimized out>)
>> 		at ../sysdeps/posix/opendir.c:92
>> 	#5  0x0000557c19c77de1 in remove_dir_recurse ()
>> 	#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
>> 	#7  <signal handler called>
>
> Yuck. Your analysis looks right, and certainly opendir() can't really
> work without allocating memory for the pointer-to-DIR.
>
>> To prevent this, add a flag REMOVE_DIR_SIGNAL that allows
>> remove_dir_recurse() to know that a signal is being handled and avoid
>> calling opendir(3). One implication of this change is that when
>> signal handling, the temporary directory may not get cleaned up
>> properly.
>
> It's not really "may not" here, is it? It will never get cleaned up on a
> signal now. I don't think remove_dir_recursively() will try to rmdir()
> in this case. But even if it did, we'll always have a "pack"
> subdirectory (minus the small race before we've created it).
>
> That's unfortunate, but I don't think we really have a portable
> alternative. We can't keep an exact list of files to be deleted, because
> some of them will be created by sub-processes. We could perhaps exec a
> helper that does the deletion, but that seems like a race and
> portability nightmare. On Linux, we could probably use open() and
> getdents64() to traverse, but obviously that won't work everywhere. It
> _might_ be worth having some kind of compat/ wrapper here, to let
> supported systems do as good a job as they can. But it's not like there
> aren't already cases where we might leave the tmp-objdir directory
> around (say, SIGKILL), so this is really just extending the existing
> problem to more signals.
>
> I was going to suggest we should do a better job of cleaning up these
> directories via git-gc. But it looks like b3cecf49ea (tmp-objdir: new
> API for creating temporary writable databases, 2021-12-06) changed the
> default name such that a regular git-gc should do so. So I think we're
> covered there.

I was wondering about this as well. Thanks for checking on this--that's
reassuring.
>
> The main case we care about is normal exit when index-pack or a hook
> sees an error, in which case we should still be cleaning up via the
> atexit() handler.
>
> So I think your patch is going in the right direction, but...
>
>>  static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>>  {
>> -	DIR *dir;
>> +	DIR *dir = NULL;
>>  	struct dirent *e;
>>  	int ret = 0, original_len = path->len, len, kept_down = 0;
>>  	int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY);
>> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up)
>>  	}
>>
>>  	flag &= ~REMOVE_DIR_KEEP_TOPLEVEL;
>> -	dir = opendir(path->buf);
>> +
>> +	if ((flag & REMOVE_DIR_SIGNAL) == 0)
>> +		dir = opendir(path->buf);
>> +
>>  	if (!dir) {
>>  		if (errno == ENOENT)
>>  			return keep_toplevel ? -1 : 0;
>
> We skip calling opendir() entirely, so "dir" will still be NULL. But we
> immediately start looking at errno, which will have some undefined value
> (based on some previous syscall).
>
> If we set "errno" to "EACCES" in this case, then we'd at least hit the
> rmdir() below:
>
>          if (!dir) {
>                   if (errno == ENOENT)
>                           return keep_toplevel ? -1 : 0;
>                   else if (errno == EACCES && !keep_toplevel)
>                           /*
>                            * An empty dir could be removable even if it
>                            * is unreadable:
>                            */
>                           return rmdir(path->buf);
>                   else
>                           return -1;
>           }
>
> but we know it won't really do anything for our proposed caller, since
> it will have files inside the directory that need to be removed before
> rmdir() can work.

yeah, I suppose the only case it would help is if the directory is already
empty.

>
> Moreover, if you were to combine REMOVE_DIR_SIGNAL with
> REMOVE_DIR_KEEP_NESTED_GIT, I suspect that the call to
> resolve_gitlink_ref() would end up with similar deadlocks. Obviously
> nobody is proposing to do that, but it is a pitfall in the API.
>
> So all of that makes me think we should not add a new flag here, but
> instead just avoid calling the function entirely from
> tmp_objdir_destroy_1().
>
> But then we can observe that tmp_objdir_destroy_1() is basically doing
> nothing if on_signal is set. So there is really no point in setting up
> the signal handler at all. We should just set up the atexit() handler.
> I.e., something like:
>
> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index a8be92bca1..10549e95db 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
>  	the_tmp_objdir = t;
>  	if (!installed_handlers) {
>  		atexit(remove_tmp_objdir);
> -		sigchain_push_common(remove_tmp_objdir_on_signal);
>  		installed_handlers++;
>  	}

This makes sense and is a clean solution. I guess the only case where we would
benefit in calling into remove_tmp_objdir_on_signal() is if the temp dir exists
but is empty. I'm not sure how often this would happen to make it worth it?
Probably not...

>
>
> with the commit message explaining that we can't do the cleanup in a
> portable and signal-safe way, so we just punt on the whole concept.
>
> There's also some minor cleanup we could do elsewhere to drop the
> "on_signal" argument (which can come as part of the same patch, or on
> top).
>
> -Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-27 11:44 ` Jeff King
  2022-09-27 13:50   ` John Cai
@ 2022-09-27 16:50   ` Junio C Hamano
  1 sibling, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2022-09-27 16:50 UTC (permalink / raw)
  To: Jeff King; +Cc: John Cai via GitGitGadget, git, John Cai

Jeff King <peff@peff.net> writes:

> So all of that makes me think we should not add a new flag here, but
> instead just avoid calling the function entirely from
> tmp_objdir_destroy_1().

Thanks.  I missed that undefined access to errno that breaks the
intention of the patch.

> But then we can observe that tmp_objdir_destroy_1() is basically doing
> nothing if on_signal is set. So there is really no point in setting up
> the signal handler at all. We should just set up the atexit() handler.
> I.e., something like:
>
> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index a8be92bca1..10549e95db 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
>  	the_tmp_objdir = t;
>  	if (!installed_handlers) {
>  		atexit(remove_tmp_objdir);
> -		sigchain_push_common(remove_tmp_objdir_on_signal);
>  		installed_handlers++;
>  	}
>  
>
> with the commit message explaining that we can't do the cleanup in a
> portable and signal-safe way, so we just punt on the whole concept.
>
> There's also some minor cleanup we could do elsewhere to drop the
> "on_signal" argument (which can come as part of the same patch, or on
> top).

;-)  I like the simplification.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmp-objdir: do not opendir() when handling a signal
  2022-09-27 13:50   ` John Cai
@ 2022-09-27 19:03     ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2022-09-27 19:03 UTC (permalink / raw)
  To: John Cai; +Cc: John Cai via GitGitGadget, git

On Tue, Sep 27, 2022 at 09:50:43AM -0400, John Cai wrote:

> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> > index a8be92bca1..10549e95db 100644
> > --- a/tmp-objdir.c
> > +++ b/tmp-objdir.c
> > @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
> >  	the_tmp_objdir = t;
> >  	if (!installed_handlers) {
> >  		atexit(remove_tmp_objdir);
> > -		sigchain_push_common(remove_tmp_objdir_on_signal);
> >  		installed_handlers++;
> >  	}
> 
> This makes sense and is a clean solution. I guess the only case where we would
> benefit in calling into remove_tmp_objdir_on_signal() is if the temp dir exists
> but is empty. I'm not sure how often this would happen to make it worth it?
> Probably not...

It's more or less never, as the first thing we do after creating it is
call tmp_objdir_setup(), which creates the pack subdirectory. We _could_
try removing them both manually, like:

diff --git a/tmp-objdir.c b/tmp-objdir.c
index adf6033549..509eb58363 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -49,7 +49,17 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	 * have pre-grown t->path sufficiently so that this
 	 * doesn't happen in practice.
 	 */
-	err = remove_dir_recursively(&t->path, 0);
+	if (!on_signal)
+		err = remove_dir_recursively(&t->path, 0);
+	else {
+		size_t orig_len = t->path.len;
+
+		strbuf_addstr(&t->path, "/pack");
+		rmdir(t->path.buf);
+		strbuf_setlen(&t->path, orig_len);
+
+		rmdir(t->path.buf);
+	}
 
 	/*
 	 * When we are cleaning up due to a signal, we won't bother

but even that would rarely do anything useful. As soon as the child
index-pack receives even a single byte, we'll have an actual pack
tmpfile with an unknown name, which will cause rmdir() to fail (ditto
for unpack-objects, though it would probably write as soon as we've
received a whole single object).

If we can't enumerate the list of objects via readdir() or similar, I
think there's really not much we can do.

-Peff

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2] tmp-objdir: skip clean up when handling a signal
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-09-27 11:44 ` Jeff King
@ 2022-09-27 19:19 ` John Cai via GitGitGadget
  2022-09-27 19:38   ` Jeff King
  2022-09-28 14:55   ` [PATCH v3] " John Cai via GitGitGadget
  2022-10-20 11:58 ` Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal) Jan Pokorný
  5 siblings, 2 replies; 18+ messages in thread
From: John Cai via GitGitGadget @ 2022-09-27 19:19 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood, Jeff King, John Cai, John Cai

From: John Cai <johncai86@gmail.com>

In the tmp-objdir api, tmp_objdir_create will create a temporary
directory but also register signal handlers responsible for removing
the directory's contents and the directory itself. However, the
function responsible for recursively removing the contents and
directory, remove_dir_recurse() calls opendir(3) and closedir(3).
This can be problematic because these functions allocate and free
memory, which are not async-signal-safe functions. This can lead to
deadlocks.

One place we call tmp_objdir_create() is in git-receive-pack, where
we create a temporary quarantine directory "incoming". Incoming
objects will be written to this directory before they get moved to
the object directory.

We have observed this code leading to a deadlock:

	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
		<main_arena>) at ./lowlevellock.c:35
	#1  0x00007f621baa635b in __GI___libc_malloc
		(bytes=bytes@entry=32816) at malloc.c:3064
	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
		flags=0, close_fd=true, fd=5)
		at ../sysdeps/posix/opendir.c:118
	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
	#4  __opendir (name=<optimized out>)
		at ../sysdeps/posix/opendir.c:92
	#5  0x0000557c19c77de1 in remove_dir_recurse ()
	git#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
	#7  <signal handler called>
	git#8  _int_malloc (av=av@entry=0x7f621bbf8b80 <main_arena>,
		bytes=bytes@entry=7160) at malloc.c:4116
	git#9  0x00007f621baa62c9 in __GI___libc_malloc (bytes=7160)
		at malloc.c:3066
	git#10 0x00007f621bd1e987 in inflateInit2_ ()
		from /opt/gitlab/embedded/lib/libz.so.1
	git#11 0x0000557c19dbe5f4 in git_inflate_init ()
	git#12 0x0000557c19cee02a in unpack_compressed_entry ()
	git#13 0x0000557c19cf08cb in unpack_entry ()
	git#14 0x0000557c19cf0f32 in packed_object_info ()
	git#15 0x0000557c19cd68cd in do_oid_object_info_extended ()
	git#16 0x0000557c19cd6e2b in read_object_file_extended ()
	git#17 0x0000557c19cdec2f in parse_object ()
	git#18 0x0000557c19c34977 in lookup_commit_reference_gently ()
	git#19 0x0000557c19d69309 in mark_uninteresting ()
	git#20 0x0000557c19d2d180 in do_for_each_repo_ref_iterator ()
	git#21 0x0000557c19d21678 in for_each_ref ()
	git#22 0x0000557c19d6a94f in assign_shallow_commits_to_refs ()
	git#23 0x0000557c19bc02b2 in cmd_receive_pack ()
	git#24 0x0000557c19b29fdd in handle_builtin ()
	git#25 0x0000557c19b2a526 in cmd_main ()
	git#26 0x0000557c19b28ea2 in main ()

Since we can't do the cleanup in a portable and signal-safe way, skip
the cleanup when we're handling a signal.

This means that when signal handling, the temporary directory may not
get cleaned up properly. This is mitigated by b3cecf49ea (tmp-objdir: new
API for creating temporary writable databases, 2021-12-06) which changed
the default name and allows gc to clean up these temporary directories.

Signed-off-by: John Cai <johncai86@gmail.com>
---
    tmp-objdir: do not closedir() when handling a signal
    
    We have recently observed a Git process hanging around for weeks. A
    backtrace revealed that a git-receive-pack(1) process was deadlocked
    when trying to remove the quarantine directory "incoming." It turns out
    that the tmp_objdir API calls opendir(3) and closedir(3) to observe a
    directory's contents in order to remove all the contents before removing
    the directory itself. These functions are not async signal save as they
    allocate and free memory.
    
    The fix is to avoid calling these functions when handling a signal in
    order to avoid a deadlock. The implication of such a fix however, is
    that temporary object directories may not get cleaned up properly when a
    signal is being handled. The tradeoff this fix is making is to prevent
    deadlocks at the cost of temporary object directory cleanup.
    
    This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when
    cleaning up lockfiles in async signals) The difference being in this
    case, the cleanup of the files will not happen when handling a signal.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1348%2Fjohn-cai%2Fjc%2Ffix-tmp-objdir-signal-handling-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1348/john-cai/jc/fix-tmp-objdir-signal-handling-v2
Pull-Request: https://github.com/git/git/pull/1348

Range-diff vs v1:

 1:  78ee805c5aa < -:  ----------- tmp-objdir: do not opendir() when handling a signal
 -:  ----------- > 1:  ae58df8c938 tmp-objdir: skip clean up when handling a signal


 tmp-objdir.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index adf6033549e..5d5f15f6d76 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -31,7 +31,7 @@ static void tmp_objdir_free(struct tmp_objdir *t)
 	free(t);
 }
 
-static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
+static int tmp_objdir_destroy_1(struct tmp_objdir *t)
 {
 	int err;
 
@@ -41,7 +41,7 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	if (t == the_tmp_objdir)
 		the_tmp_objdir = NULL;
 
-	if (!on_signal && t->prev_odb)
+	if (t->prev_odb)
 		restore_primary_odb(t->prev_odb, t->path.buf);
 
 	/*
@@ -51,20 +51,14 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	 */
 	err = remove_dir_recursively(&t->path, 0);
 
-	/*
-	 * When we are cleaning up due to a signal, we won't bother
-	 * freeing memory; it may cause a deadlock if the signal
-	 * arrived while libc's allocator lock is held.
-	 */
-	if (!on_signal)
-		tmp_objdir_free(t);
+	tmp_objdir_free(t);
 
 	return err;
 }
 
 int tmp_objdir_destroy(struct tmp_objdir *t)
 {
-	return tmp_objdir_destroy_1(t, 0);
+	return tmp_objdir_destroy_1(t);
 }
 
 static void remove_tmp_objdir(void)
@@ -72,13 +66,6 @@ static void remove_tmp_objdir(void)
 	tmp_objdir_destroy(the_tmp_objdir);
 }
 
-static void remove_tmp_objdir_on_signal(int signo)
-{
-	tmp_objdir_destroy_1(the_tmp_objdir, 1);
-	sigchain_pop(signo);
-	raise(signo);
-}
-
 void tmp_objdir_discard_objects(struct tmp_objdir *t)
 {
 	remove_dir_recursively(&t->path, REMOVE_DIR_KEEP_TOPLEVEL);
@@ -169,7 +156,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	the_tmp_objdir = t;
 	if (!installed_handlers) {
 		atexit(remove_tmp_objdir);
-		sigchain_push_common(remove_tmp_objdir_on_signal);
 		installed_handlers++;
 	}
 

base-commit: 4fd6c5e44459e6444c2cd93383660134c95aabd1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tmp-objdir: skip clean up when handling a signal
  2022-09-27 19:19 ` [PATCH v2] tmp-objdir: skip clean up " John Cai via GitGitGadget
@ 2022-09-27 19:38   ` Jeff King
  2022-09-27 20:00     ` Jeff King
  2022-09-28 14:55   ` [PATCH v3] " John Cai via GitGitGadget
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff King @ 2022-09-27 19:38 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, Taylor Blau, Phillip Wood, John Cai

On Tue, Sep 27, 2022 at 07:19:01PM +0000, John Cai via GitGitGadget wrote:

> Since we can't do the cleanup in a portable and signal-safe way, skip
> the cleanup when we're handling a signal.

Thanks, this looks fine to me, though I think there are a few extra
cleanup opportunities that could be squashed in:

diff --git a/tmp-objdir.c b/tmp-objdir.c
index 5d5f15f6d7..2fb0ec8317 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -18,7 +18,7 @@ struct tmp_objdir {
 
 /*
  * Allow only one tmp_objdir at a time in a running process, which simplifies
- * our signal/atexit cleanup routines.  It's doubtful callers will ever need
+ * our atexit cleanup routine.  It's doubtful callers will ever need
  * more than one, and we can expand later if so.  You can have many such
  * tmp_objdirs simultaneously in many processes, of course.
  */
@@ -31,7 +31,7 @@ static void tmp_objdir_free(struct tmp_objdir *t)
 	free(t);
 }
 
-static int tmp_objdir_destroy_1(struct tmp_objdir *t)
+static int tmp_objdir_destroy(struct tmp_objdir *t)
 {
 	int err;
 
@@ -44,23 +44,13 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t)
 	if (t->prev_odb)
 		restore_primary_odb(t->prev_odb, t->path.buf);
 
-	/*
-	 * This may use malloc via strbuf_grow(), but we should
-	 * have pre-grown t->path sufficiently so that this
-	 * doesn't happen in practice.
-	 */
 	err = remove_dir_recursively(&t->path, 0);
 
 	tmp_objdir_free(t);
 
 	return err;
 }
 
-int tmp_objdir_destroy(struct tmp_objdir *t)
-{
-	return tmp_objdir_destroy_1(t);
-}
-
 static void remove_tmp_objdir(void)
 {
 	tmp_objdir_destroy(the_tmp_objdir);
@@ -139,14 +129,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	 */
 	strbuf_addf(&t->path, "%s/tmp_objdir-%s-XXXXXX", get_object_directory(), prefix);
 
-	/*
-	 * Grow the strbuf beyond any filename we expect to be placed in it.
-	 * If tmp_objdir_destroy() is called by a signal handler, then
-	 * we should be able to use the strbuf to remove files without
-	 * having to call malloc.
-	 */
-	strbuf_grow(&t->path, 1024);
-
 	if (!mkdtemp(t->path.buf)) {
 		/* free, not destroy, as we never touched the filesystem */
 		tmp_objdir_free(t);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] tmp-objdir: skip clean up when handling a signal
  2022-09-27 19:38   ` Jeff King
@ 2022-09-27 20:00     ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2022-09-27 20:00 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, Taylor Blau, Phillip Wood, John Cai

On Tue, Sep 27, 2022 at 03:38:26PM -0400, Jeff King wrote:

> > Since we can't do the cleanup in a portable and signal-safe way, skip
> > the cleanup when we're handling a signal.
> 
> Thanks, this looks fine to me, though I think there are a few extra
> cleanup opportunities that could be squashed in:

> -static int tmp_objdir_destroy_1(struct tmp_objdir *t)
> +static int tmp_objdir_destroy(struct tmp_objdir *t)

Whoops, you'd obviously want to drop the "static" here, as well.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] tmp-objdir: skip clean up when handling a signal
  2022-09-27 19:19 ` [PATCH v2] tmp-objdir: skip clean up " John Cai via GitGitGadget
  2022-09-27 19:38   ` Jeff King
@ 2022-09-28 14:55   ` John Cai via GitGitGadget
  2022-09-28 15:38     ` Ævar Arnfjörð Bjarmason
  2022-09-30 20:47     ` [PATCH v4] " John Cai via GitGitGadget
  1 sibling, 2 replies; 18+ messages in thread
From: John Cai via GitGitGadget @ 2022-09-28 14:55 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Phillip Wood, Jeff King, John Cai, John Cai

From: John Cai <johncai86@gmail.com>

In the tmp-objdir api, tmp_objdir_create will create a temporary
directory but also register signal handlers responsible for removing
the directory's contents and the directory itself. However, the
function responsible for recursively removing the contents and
directory, remove_dir_recurse() calls opendir(3) and closedir(3).
This can be problematic because these functions allocate and free
memory, which are not async-signal-safe functions. This can lead to
deadlocks.

One place we call tmp_objdir_create() is in git-receive-pack, where
we create a temporary quarantine directory "incoming". Incoming
objects will be written to this directory before they get moved to
the object directory.

We have observed this code leading to a deadlock:

	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
		<main_arena>) at ./lowlevellock.c:35
	#1  0x00007f621baa635b in __GI___libc_malloc
		(bytes=bytes@entry=32816) at malloc.c:3064
	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
		flags=0, close_fd=true, fd=5)
		at ../sysdeps/posix/opendir.c:118
	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
	#4  __opendir (name=<optimized out>)
		at ../sysdeps/posix/opendir.c:92
	#5  0x0000557c19c77de1 in remove_dir_recurse ()
	git#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
	#7  <signal handler called>
	git#8  _int_malloc (av=av@entry=0x7f621bbf8b80 <main_arena>,
		bytes=bytes@entry=7160) at malloc.c:4116
	git#9  0x00007f621baa62c9 in __GI___libc_malloc (bytes=7160)
		at malloc.c:3066
	git#10 0x00007f621bd1e987 in inflateInit2_ ()
		from /opt/gitlab/embedded/lib/libz.so.1
	git#11 0x0000557c19dbe5f4 in git_inflate_init ()
	git#12 0x0000557c19cee02a in unpack_compressed_entry ()
	git#13 0x0000557c19cf08cb in unpack_entry ()
	git#14 0x0000557c19cf0f32 in packed_object_info ()
	git#15 0x0000557c19cd68cd in do_oid_object_info_extended ()
	git#16 0x0000557c19cd6e2b in read_object_file_extended ()
	git#17 0x0000557c19cdec2f in parse_object ()
	git#18 0x0000557c19c34977 in lookup_commit_reference_gently ()
	git#19 0x0000557c19d69309 in mark_uninteresting ()
	git#20 0x0000557c19d2d180 in do_for_each_repo_ref_iterator ()
	git#21 0x0000557c19d21678 in for_each_ref ()
	git#22 0x0000557c19d6a94f in assign_shallow_commits_to_refs ()
	git#23 0x0000557c19bc02b2 in cmd_receive_pack ()
	git#24 0x0000557c19b29fdd in handle_builtin ()
	git#25 0x0000557c19b2a526 in cmd_main ()
	git#26 0x0000557c19b28ea2 in main ()

Since we can't do the cleanup in a portable and signal-safe way, skip
the cleanup when we're handling a signal.

This means that when signal handling, the temporary directory may not
get cleaned up properly. This is mitigated by b3cecf49ea (tmp-objdir: new
API for creating temporary writable databases, 2021-12-06) which changed
the default name and allows gc to clean up these temporary directories.

Signed-off-by: John Cai <johncai86@gmail.com>
---
    tmp-objdir: do not closedir() when handling a signal
    
    We have recently observed a Git process hanging around for weeks. A
    backtrace revealed that a git-receive-pack(1) process was deadlocked
    when trying to remove the quarantine directory "incoming." It turns out
    that the tmp_objdir API calls opendir(3) and closedir(3) to observe a
    directory's contents in order to remove all the contents before removing
    the directory itself. These functions are not async signal save as they
    allocate and free memory.
    
    The fix is to avoid calling these functions when handling a signal in
    order to avoid a deadlock. The implication of such a fix however, is
    that temporary object directories may not get cleaned up properly when a
    signal is being handled. The tradeoff this fix is making is to prevent
    deadlocks at the cost of temporary object directory cleanup.
    
    This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when
    cleaning up lockfiles in async signals) The difference being in this
    case, the cleanup of the files will not happen when handling a signal.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1348%2Fjohn-cai%2Fjc%2Ffix-tmp-objdir-signal-handling-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1348/john-cai/jc/fix-tmp-objdir-signal-handling-v3
Pull-Request: https://github.com/git/git/pull/1348

Range-diff vs v2:

 1:  ae58df8c938 ! 1:  d96c44e309f tmp-objdir: skip clean up when handling a signal
     @@ Commit message
          Signed-off-by: John Cai <johncai86@gmail.com>
      
       ## tmp-objdir.c ##
     +@@ tmp-objdir.c: struct tmp_objdir {
     + 
     + /*
     +  * Allow only one tmp_objdir at a time in a running process, which simplifies
     +- * our signal/atexit cleanup routines.  It's doubtful callers will ever need
     ++ * our atexit cleanup routines.  It's doubtful callers will ever need
     +  * more than one, and we can expand later if so.  You can have many such
     +  * tmp_objdirs simultaneously in many processes, of course.
     +  */
      @@ tmp-objdir.c: static void tmp_objdir_free(struct tmp_objdir *t)
       	free(t);
       }
       
      -static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
     -+static int tmp_objdir_destroy_1(struct tmp_objdir *t)
     ++int tmp_objdir_destroy(struct tmp_objdir *t)
       {
       	int err;
       
     @@ tmp-objdir.c: static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signa
      +	if (t->prev_odb)
       		restore_primary_odb(t->prev_odb, t->path.buf);
       
     - 	/*
     -@@ tmp-objdir.c: static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
     - 	 */
     +-	/*
     +-	 * This may use malloc via strbuf_grow(), but we should
     +-	 * have pre-grown t->path sufficiently so that this
     +-	 * doesn't happen in practice.
     +-	 */
       	err = remove_dir_recursively(&t->path, 0);
       
      -	/*
     @@ tmp-objdir.c: static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signa
       	return err;
       }
       
     - int tmp_objdir_destroy(struct tmp_objdir *t)
     - {
     +-int tmp_objdir_destroy(struct tmp_objdir *t)
     +-{
      -	return tmp_objdir_destroy_1(t, 0);
     -+	return tmp_objdir_destroy_1(t);
     - }
     - 
     +-}
     +-
       static void remove_tmp_objdir(void)
     -@@ tmp-objdir.c: static void remove_tmp_objdir(void)
     + {
       	tmp_objdir_destroy(the_tmp_objdir);
       }
       
     @@ tmp-objdir.c: static void remove_tmp_objdir(void)
       void tmp_objdir_discard_objects(struct tmp_objdir *t)
       {
       	remove_dir_recursively(&t->path, REMOVE_DIR_KEEP_TOPLEVEL);
     +@@ tmp-objdir.c: struct tmp_objdir *tmp_objdir_create(const char *prefix)
     + 	 */
     + 	strbuf_addf(&t->path, "%s/tmp_objdir-%s-XXXXXX", get_object_directory(), prefix);
     + 
     +-	/*
     +-	 * Grow the strbuf beyond any filename we expect to be placed in it.
     +-	 * If tmp_objdir_destroy() is called by a signal handler, then
     +-	 * we should be able to use the strbuf to remove files without
     +-	 * having to call malloc.
     +-	 */
     +-	strbuf_grow(&t->path, 1024);
     +-
     + 	if (!mkdtemp(t->path.buf)) {
     + 		/* free, not destroy, as we never touched the filesystem */
     + 		tmp_objdir_free(t);
      @@ tmp-objdir.c: struct tmp_objdir *tmp_objdir_create(const char *prefix)
       	the_tmp_objdir = t;
       	if (!installed_handlers) {


 tmp-objdir.c | 40 ++++------------------------------------
 1 file changed, 4 insertions(+), 36 deletions(-)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index adf6033549e..2a2012eb6d0 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -18,7 +18,7 @@ struct tmp_objdir {
 
 /*
  * Allow only one tmp_objdir at a time in a running process, which simplifies
- * our signal/atexit cleanup routines.  It's doubtful callers will ever need
+ * our atexit cleanup routines.  It's doubtful callers will ever need
  * more than one, and we can expand later if so.  You can have many such
  * tmp_objdirs simultaneously in many processes, of course.
  */
@@ -31,7 +31,7 @@ static void tmp_objdir_free(struct tmp_objdir *t)
 	free(t);
 }
 
-static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
+int tmp_objdir_destroy(struct tmp_objdir *t)
 {
 	int err;
 
@@ -41,44 +41,21 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	if (t == the_tmp_objdir)
 		the_tmp_objdir = NULL;
 
-	if (!on_signal && t->prev_odb)
+	if (t->prev_odb)
 		restore_primary_odb(t->prev_odb, t->path.buf);
 
-	/*
-	 * This may use malloc via strbuf_grow(), but we should
-	 * have pre-grown t->path sufficiently so that this
-	 * doesn't happen in practice.
-	 */
 	err = remove_dir_recursively(&t->path, 0);
 
-	/*
-	 * When we are cleaning up due to a signal, we won't bother
-	 * freeing memory; it may cause a deadlock if the signal
-	 * arrived while libc's allocator lock is held.
-	 */
-	if (!on_signal)
-		tmp_objdir_free(t);
+	tmp_objdir_free(t);
 
 	return err;
 }
 
-int tmp_objdir_destroy(struct tmp_objdir *t)
-{
-	return tmp_objdir_destroy_1(t, 0);
-}
-
 static void remove_tmp_objdir(void)
 {
 	tmp_objdir_destroy(the_tmp_objdir);
 }
 
-static void remove_tmp_objdir_on_signal(int signo)
-{
-	tmp_objdir_destroy_1(the_tmp_objdir, 1);
-	sigchain_pop(signo);
-	raise(signo);
-}
-
 void tmp_objdir_discard_objects(struct tmp_objdir *t)
 {
 	remove_dir_recursively(&t->path, REMOVE_DIR_KEEP_TOPLEVEL);
@@ -152,14 +129,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	 */
 	strbuf_addf(&t->path, "%s/tmp_objdir-%s-XXXXXX", get_object_directory(), prefix);
 
-	/*
-	 * Grow the strbuf beyond any filename we expect to be placed in it.
-	 * If tmp_objdir_destroy() is called by a signal handler, then
-	 * we should be able to use the strbuf to remove files without
-	 * having to call malloc.
-	 */
-	strbuf_grow(&t->path, 1024);
-
 	if (!mkdtemp(t->path.buf)) {
 		/* free, not destroy, as we never touched the filesystem */
 		tmp_objdir_free(t);
@@ -169,7 +138,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	the_tmp_objdir = t;
 	if (!installed_handlers) {
 		atexit(remove_tmp_objdir);
-		sigchain_push_common(remove_tmp_objdir_on_signal);
 		installed_handlers++;
 	}
 

base-commit: 4fd6c5e44459e6444c2cd93383660134c95aabd1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] tmp-objdir: skip clean up when handling a signal
  2022-09-28 14:55   ` [PATCH v3] " John Cai via GitGitGadget
@ 2022-09-28 15:38     ` Ævar Arnfjörð Bjarmason
  2022-09-30 20:47     ` [PATCH v4] " John Cai via GitGitGadget
  1 sibling, 0 replies; 18+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-28 15:38 UTC (permalink / raw)
  To: John Cai via GitGitGadget
  Cc: git, Taylor Blau, Phillip Wood, Jeff King, John Cai


On Wed, Sep 28 2022, John Cai via GitGitGadget wrote:

> From: John Cai <johncai86@gmail.com>

This looks good! Just a clarifying question/comment:

> Since we can't do the cleanup in a portable and signal-safe way, skip
> the cleanup when we're handling a signal.
>
> This means that when signal handling, the temporary directory may not
> get cleaned up properly. This is mitigated by b3cecf49ea (tmp-objdir: new
> API for creating temporary writable databases, 2021-12-06) which changed
> the default name and allows gc to clean up these temporary directories.

I think this still doesn't cover the common case of the "atexit" handler
saving the day, as Jeff King pointed out in v1:

	https://lore.kernel.org/git/YzLiI1HZeBszsIJq@coredump.intra.peff.net/

I think it's fine to have this proceed as-is (although if you're doing a
v4 anyway, maybe we want to update tho commit message). I.e. is it
correct that we'll now only skip this if we get a signal, *and* it's a
fatal signal, or we otherwise die/exit/abort/whatever before we can get
to our atexit() handler?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4] tmp-objdir: skip clean up when handling a signal
  2022-09-28 14:55   ` [PATCH v3] " John Cai via GitGitGadget
  2022-09-28 15:38     ` Ævar Arnfjörð Bjarmason
@ 2022-09-30 20:47     ` John Cai via GitGitGadget
  2022-10-03  8:52       ` Jeff King
  1 sibling, 1 reply; 18+ messages in thread
From: John Cai via GitGitGadget @ 2022-09-30 20:47 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Phillip Wood, Jeff King,
	Ævar Arnfjörð Bjarmason, John Cai, John Cai

From: John Cai <johncai86@gmail.com>

In the tmp-objdir api, tmp_objdir_create will create a temporary
directory but also register signal handlers responsible for removing
the directory's contents and the directory itself. However, the
function responsible for recursively removing the contents and
directory, remove_dir_recurse() calls opendir(3) and closedir(3).
This can be problematic because these functions allocate and free
memory, which are not async-signal-safe functions. This can lead to
deadlocks.

One place we call tmp_objdir_create() is in git-receive-pack, where
we create a temporary quarantine directory "incoming". Incoming
objects will be written to this directory before they get moved to
the object directory.

We have observed this code leading to a deadlock:

	Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)):
	#0  __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80
		<main_arena>) at ./lowlevellock.c:35
	#1  0x00007f621baa635b in __GI___libc_malloc
		(bytes=bytes@entry=32816) at malloc.c:3064
	#2  0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60,
		flags=0, close_fd=true, fd=5)
		at ../sysdeps/posix/opendir.c:118
	#3  opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69
	#4  __opendir (name=<optimized out>)
		at ../sysdeps/posix/opendir.c:92
	#5  0x0000557c19c77de1 in remove_dir_recurse ()
	git#6  0x0000557c19d81a4f in remove_tmp_objdir_on_signal ()
	#7  <signal handler called>
	git#8  _int_malloc (av=av@entry=0x7f621bbf8b80 <main_arena>,
		bytes=bytes@entry=7160) at malloc.c:4116
	git#9  0x00007f621baa62c9 in __GI___libc_malloc (bytes=7160)
		at malloc.c:3066
	git#10 0x00007f621bd1e987 in inflateInit2_ ()
		from /opt/gitlab/embedded/lib/libz.so.1
	git#11 0x0000557c19dbe5f4 in git_inflate_init ()
	git#12 0x0000557c19cee02a in unpack_compressed_entry ()
	git#13 0x0000557c19cf08cb in unpack_entry ()
	git#14 0x0000557c19cf0f32 in packed_object_info ()
	git#15 0x0000557c19cd68cd in do_oid_object_info_extended ()
	git#16 0x0000557c19cd6e2b in read_object_file_extended ()
	git#17 0x0000557c19cdec2f in parse_object ()
	git#18 0x0000557c19c34977 in lookup_commit_reference_gently ()
	git#19 0x0000557c19d69309 in mark_uninteresting ()
	git#20 0x0000557c19d2d180 in do_for_each_repo_ref_iterator ()
	git#21 0x0000557c19d21678 in for_each_ref ()
	git#22 0x0000557c19d6a94f in assign_shallow_commits_to_refs ()
	git#23 0x0000557c19bc02b2 in cmd_receive_pack ()
	git#24 0x0000557c19b29fdd in handle_builtin ()
	git#25 0x0000557c19b2a526 in cmd_main ()
	git#26 0x0000557c19b28ea2 in main ()

Since we can't do the cleanup in a portable and signal-safe way, skip
the cleanup when we're handling a signal.

This means that when signal handling, the temporary directory may not
get cleaned up properly. This is mitigated by b3cecf49ea (tmp-objdir: new
API for creating temporary writable databases, 2021-12-06) which changed
the default name and allows gc to clean up these temporary directories.

In the event of a normal exit, we should still be cleaning up via the
atexit() handler.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: John Cai <johncai86@gmail.com>
---
    tmp-objdir: do not closedir() when handling a signal
    
    We have recently observed a Git process hanging around for weeks. A
    backtrace revealed that a git-receive-pack(1) process was deadlocked
    when trying to remove the quarantine directory "incoming." It turns out
    that the tmp_objdir API calls opendir(3) and closedir(3) to observe a
    directory's contents in order to remove all the contents before removing
    the directory itself. These functions are not async signal save as they
    allocate and free memory.
    
    The fix is to avoid calling these functions when handling a signal in
    order to avoid a deadlock. The implication of such a fix however, is
    that temporary object directories may not get cleaned up properly when a
    signal is being handled. The tradeoff this fix is making is to prevent
    deadlocks at the cost of temporary object directory cleanup.
    
    This is similar to 58d4d7f1c5 (2022-01-07 fetch: fix deadlock when
    cleaning up lockfiles in async signals) The difference being in this
    case, the cleanup of the files will not happen when handling a signal.
    
    Changes since v3:
    
     * Updated commit message to clarify that we still clean up the tmp dir
       on atexit

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1348%2Fjohn-cai%2Fjc%2Ffix-tmp-objdir-signal-handling-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1348/john-cai/jc/fix-tmp-objdir-signal-handling-v4
Pull-Request: https://github.com/git/git/pull/1348

Range-diff vs v3:

 1:  d96c44e309f ! 1:  0ea467f25ca tmp-objdir: skip clean up when handling a signal
     @@ Commit message
          API for creating temporary writable databases, 2021-12-06) which changed
          the default name and allows gc to clean up these temporary directories.
      
     +    In the event of a normal exit, we should still be cleaning up via the
     +    atexit() handler.
     +
     +    Helped-by: Jeff King <peff@peff.net>
          Signed-off-by: John Cai <johncai86@gmail.com>
      
       ## tmp-objdir.c ##


 tmp-objdir.c | 40 ++++------------------------------------
 1 file changed, 4 insertions(+), 36 deletions(-)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index adf6033549e..2a2012eb6d0 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -18,7 +18,7 @@ struct tmp_objdir {
 
 /*
  * Allow only one tmp_objdir at a time in a running process, which simplifies
- * our signal/atexit cleanup routines.  It's doubtful callers will ever need
+ * our atexit cleanup routines.  It's doubtful callers will ever need
  * more than one, and we can expand later if so.  You can have many such
  * tmp_objdirs simultaneously in many processes, of course.
  */
@@ -31,7 +31,7 @@ static void tmp_objdir_free(struct tmp_objdir *t)
 	free(t);
 }
 
-static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
+int tmp_objdir_destroy(struct tmp_objdir *t)
 {
 	int err;
 
@@ -41,44 +41,21 @@ static int tmp_objdir_destroy_1(struct tmp_objdir *t, int on_signal)
 	if (t == the_tmp_objdir)
 		the_tmp_objdir = NULL;
 
-	if (!on_signal && t->prev_odb)
+	if (t->prev_odb)
 		restore_primary_odb(t->prev_odb, t->path.buf);
 
-	/*
-	 * This may use malloc via strbuf_grow(), but we should
-	 * have pre-grown t->path sufficiently so that this
-	 * doesn't happen in practice.
-	 */
 	err = remove_dir_recursively(&t->path, 0);
 
-	/*
-	 * When we are cleaning up due to a signal, we won't bother
-	 * freeing memory; it may cause a deadlock if the signal
-	 * arrived while libc's allocator lock is held.
-	 */
-	if (!on_signal)
-		tmp_objdir_free(t);
+	tmp_objdir_free(t);
 
 	return err;
 }
 
-int tmp_objdir_destroy(struct tmp_objdir *t)
-{
-	return tmp_objdir_destroy_1(t, 0);
-}
-
 static void remove_tmp_objdir(void)
 {
 	tmp_objdir_destroy(the_tmp_objdir);
 }
 
-static void remove_tmp_objdir_on_signal(int signo)
-{
-	tmp_objdir_destroy_1(the_tmp_objdir, 1);
-	sigchain_pop(signo);
-	raise(signo);
-}
-
 void tmp_objdir_discard_objects(struct tmp_objdir *t)
 {
 	remove_dir_recursively(&t->path, REMOVE_DIR_KEEP_TOPLEVEL);
@@ -152,14 +129,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	 */
 	strbuf_addf(&t->path, "%s/tmp_objdir-%s-XXXXXX", get_object_directory(), prefix);
 
-	/*
-	 * Grow the strbuf beyond any filename we expect to be placed in it.
-	 * If tmp_objdir_destroy() is called by a signal handler, then
-	 * we should be able to use the strbuf to remove files without
-	 * having to call malloc.
-	 */
-	strbuf_grow(&t->path, 1024);
-
 	if (!mkdtemp(t->path.buf)) {
 		/* free, not destroy, as we never touched the filesystem */
 		tmp_objdir_free(t);
@@ -169,7 +138,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix)
 	the_tmp_objdir = t;
 	if (!installed_handlers) {
 		atexit(remove_tmp_objdir);
-		sigchain_push_common(remove_tmp_objdir_on_signal);
 		installed_handlers++;
 	}
 

base-commit: 4fd6c5e44459e6444c2cd93383660134c95aabd1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4] tmp-objdir: skip clean up when handling a signal
  2022-09-30 20:47     ` [PATCH v4] " John Cai via GitGitGadget
@ 2022-10-03  8:52       ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2022-10-03  8:52 UTC (permalink / raw)
  To: John Cai via GitGitGadget
  Cc: git, Taylor Blau, Phillip Wood,
	Ævar Arnfjörð Bjarmason, John Cai

On Fri, Sep 30, 2022 at 08:47:11PM +0000, John Cai via GitGitGadget wrote:

>     Changes since v3:
>     
>      * Updated commit message to clarify that we still clean up the tmp dir
>        on atexit

Thanks, this version looks great to me!

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal)
  2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-09-27 19:19 ` [PATCH v2] tmp-objdir: skip clean up " John Cai via GitGitGadget
@ 2022-10-20 11:58 ` Jan Pokorný
  2022-10-20 18:21   ` Jeff King
  5 siblings, 1 reply; 18+ messages in thread
From: Jan Pokorný @ 2022-10-20 11:58 UTC (permalink / raw)
  To: John Cai via GitGitGadget, git
  Cc: John Cai, Taylor Blau, Phillip Wood,
	Ævar Arnfjörð Bjarmason

John Cai via GitGitGadget píše v Po 26. 09. 2022 v 23:53 +0000:
> From: John Cai <johncai86@gmail.com>
> 
> In the tmp-objdir api, tmp_objdir_create will create a temporary
> directory but also register signal handlers responsible for removing
> the directory's contents and the directory itself. However, the
> function responsible for recursively removing the contents and
> directory, remove_dir_recurse() calls opendir(3) and closedir(3).
> This can be problematic because these functions allocate and free
> memory, which are not async-signal-safe functions. This can lead to
> deadlocks.
> 
> One place we call tmp_objdir_create() is in git-receive-pack, where
> we create a temporary quarantine directory "incoming". Incoming
> objects will be written to this directory before they get moved to
> the object directory.

Just noticed this unattended git crash in the logs and I think it might
actually be another occurrence of the same problem in principle, so
shamelessly piggy-backing the stack trace here (no coredump preserved
at this point, sorry):

    #0  0x00007f08df0ea06c __pthread_kill_implementation (libc.so.6 + 0x8b06c)
    #1  0x00007f08df098046 raise (libc.so.6 + 0x39046)
    #2  0x00007f08df0817fc abort (libc.so.6 + 0x227fc)
    #3  0x00007f08df082533 __libc_message.cold (libc.so.6 + 0x23533)
    #4  0x00007f08df090a67 __libc_assert_fail (libc.so.6 + 0x31a67)
    #5  0x00007f08df0f68f2 sysmalloc (libc.so.6 + 0x978f2)
    #6  0x00007f08df0f7789 _int_malloc (libc.so.6 + 0x98789)
    #7  0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2)
    #8  0x00007f08df12fa55 __alloc_dir (libc.so.6 + 0xd0a55)
  ->#9  0x00007f08df12fac2 opendir_tail (libc.so.6 + 0xd0ac2)
    #10 0x00005632ea10c823 remove_temporary_files.lto_priv.0 (git + 0xdc823)
  ->#11 0x00005632ea10c97c remove_pack_on_signal.lto_priv.0 (git + 0xdc97c)
    #12 0x00007f08df0980f0 __restore_rt (libc.so.6 + 0x390f0)
    #13 0x00007f08df0f7775 _int_malloc (libc.so.6 + 0x98775)
    #14 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2)
    #15 0x00007f08df0d2b44 _IO_file_doallocate (libc.so.6 + 0x73b44)
    #16 0x00007f08df0e1d20 _IO_doallocbuf (libc.so.6 + 0x82d20)
    #17 0x00007f08df0e0a8c _IO_file_underflow@@GLIBC_2.2.5 (libc.so.6 + 0x81a8c)
    #18 0x00007f08df0d4598 __getdelim (libc.so.6 + 0x75598)
    #19 0x00005632ea29f372 strbuf_getwholeline (git + 0x26f372)
    #20 0x00005632ea117915 cmd_repack (git + 0xe7915)
    #21 0x00005632ea0556fa handle_builtin.lto_priv.0 (git + 0x256fa)
    #22 0x00005632ea050551 main (git + 0x20551)
    #23 0x00007f08df082a50 __libc_start_call_main (libc.so.6 + 0x23a50)
    #24 0x00007f08df082b09 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x23b09)
    #25 0x00005632ea051555 _start (git + 0x21555)

As per the captured info, following Fedora x86_64 packages were involved:

  git-2.37.3-1.fc38
  zlib-1.2.12-5.fc38
  pcre2-10.40-1.fc37.1

Full command at hand:

  /usr/libexec/git-core/git repack -d -l --no-write-bitmap-index

It happened twice, actually, about a week's time apart, but that was
a month ago while this unattended task runs hourly till today and
git hasn't been updated since.

Not subscribed to the list and I don't think I can provide more info
on this, but feel free to contact me directly.

-- poki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal)
  2022-10-20 11:58 ` Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal) Jan Pokorný
@ 2022-10-20 18:21   ` Jeff King
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2022-10-20 18:21 UTC (permalink / raw)
  To: Jan Pokorný
  Cc: John Cai via GitGitGadget, git, John Cai, Taylor Blau,
	Phillip Wood, Ævar Arnfjörð Bjarmason

On Thu, Oct 20, 2022 at 01:58:58PM +0200, Jan Pokorný wrote:

> Just noticed this unattended git crash in the logs and I think it might
> actually be another occurrence of the same problem in principle, so
> shamelessly piggy-backing the stack trace here (no coredump preserved
> at this point, sorry):
> 
>     #0  0x00007f08df0ea06c __pthread_kill_implementation (libc.so.6 + 0x8b06c)
>     #1  0x00007f08df098046 raise (libc.so.6 + 0x39046)
>     #2  0x00007f08df0817fc abort (libc.so.6 + 0x227fc)
>     #3  0x00007f08df082533 __libc_message.cold (libc.so.6 + 0x23533)
>     #4  0x00007f08df090a67 __libc_assert_fail (libc.so.6 + 0x31a67)
>     #5  0x00007f08df0f68f2 sysmalloc (libc.so.6 + 0x978f2)
>     #6  0x00007f08df0f7789 _int_malloc (libc.so.6 + 0x98789)
>     #7  0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2)
>     #8  0x00007f08df12fa55 __alloc_dir (libc.so.6 + 0xd0a55)
>   ->#9  0x00007f08df12fac2 opendir_tail (libc.so.6 + 0xd0ac2)
>     #10 0x00005632ea10c823 remove_temporary_files.lto_priv.0 (git + 0xdc823)
>   ->#11 0x00005632ea10c97c remove_pack_on_signal.lto_priv.0 (git + 0xdc97c)
>     #12 0x00007f08df0980f0 __restore_rt (libc.so.6 + 0x390f0)
>     #13 0x00007f08df0f7775 _int_malloc (libc.so.6 + 0x98775)
>     #14 0x00007f08df0f80d2 __libc_malloc (libc.so.6 + 0x990d2)
>     #15 0x00007f08df0d2b44 _IO_file_doallocate (libc.so.6 + 0x73b44)
>     #16 0x00007f08df0e1d20 _IO_doallocbuf (libc.so.6 + 0x82d20)
>     #17 0x00007f08df0e0a8c _IO_file_underflow@@GLIBC_2.2.5 (libc.so.6 + 0x81a8c)
>     #18 0x00007f08df0d4598 __getdelim (libc.so.6 + 0x75598)
>     #19 0x00005632ea29f372 strbuf_getwholeline (git + 0x26f372)
>     #20 0x00005632ea117915 cmd_repack (git + 0xe7915)
>     #21 0x00005632ea0556fa handle_builtin.lto_priv.0 (git + 0x256fa)
>     #22 0x00005632ea050551 main (git + 0x20551)
>     #23 0x00007f08df082a50 __libc_start_call_main (libc.so.6 + 0x23a50)
>     #24 0x00007f08df082b09 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x23b09)
>     #25 0x00005632ea051555 _start (git + 0x21555)

Good find. This is definitely the same issue.

Unfortunately it's another hard one to fix.

git-repack asks git-pack-objects to use .tmp-pack-$$ as a prefix. So I
thought we might be able to get away with using register_tempfile() to
just record the names of potential files it creates by appending .pack,
.idx, and so forth. That code is well-tested and is careful to be
signal-safe.

But the actual name will include the checksum of the packfile, so
.tmp-pack-$$-1234abcd.pack, etc. And repack doesn't know those until
pack-objects reports the hash to us. OTOH, I think pack-objects doesn't
create the files with that name until the very last minute (it has its
own tempfiles[1]). So we could probably get away with expanding the
filenames once it tells us the hash.

Something like the patch below.

diff --git a/builtin/repack.c b/builtin/repack.c
index a5bacc7797..8d6121b438 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -122,13 +122,6 @@ static void remove_temporary_files(void)
 	strbuf_release(&buf);
 }
 
-static void remove_pack_on_signal(int signo)
-{
-	remove_temporary_files();
-	sigchain_pop(signo);
-	raise(signo);
-}
-
 /*
  * Adds all packs hex strings to either fname_nonkept_list or
  * fname_kept_list based on whether each pack has a corresponding
@@ -268,6 +261,17 @@ static unsigned populate_pack_exts(char *name)
 	return ret;
 }
 
+static void register_tmp_pack_for_cleanup(const char *name)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(exts); i++) {
+		char *path = xstrfmt("%s-%s%s", packtmp, name, exts[i].name);
+		register_tempfile(path);
+		free(path);
+	}
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -714,6 +718,7 @@ static int write_cruft_pack(const struct pack_objects_args *args,
 			die(_("repack: Expecting full hex object ID lines only "
 			      "from pack-objects."));
 		string_list_append(names, line.buf);
+		register_tmp_pack_for_cleanup(line.buf);
 	}
 	fclose(out);
 
@@ -859,8 +864,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		split_pack_geometry(geometry, geometric_factor);
 	}
 
-	sigchain_push_common(remove_pack_on_signal);
-
 	prepare_pack_objects(&cmd, &po_args);
 
 	show_progress = !po_args.quiet && isatty(2);
@@ -955,6 +958,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		string_list_append(&names, line.buf);
+		register_tmp_pack_for_cleanup(line.buf);
 	}
 	fclose(out);
 	ret = finish_command(&cmd);
@@ -1025,6 +1029,16 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			else if (unlink(fname) < 0 && errno != ENOENT)
 				die_errno(_("could not unlink: %s"), fname);
 
+			/*
+			 * We should deregister pack tempfiles here; leaving
+			 * them kind of works in that we'll try to unlink
+			 * something that is already gone, but it would be
+			 * cleaner to note that we expect them to be gone here.
+			 * We could perhaps even just use rename_tempfile() and
+			 * delete_tempfile() here, if we modified item->util to
+			 * keep track of the tempfile objects (rather than
+			 * single bits).
+			 */
 			free(fname);
 			free(fname_old);
 		}

-Peff

[1] Note that to some degree, this cleanup is already a bit of a fool's
    errand. It is only catching the "finished" packfiles before we move
    them to their final names. Whereas pack-objects is perfectly happy
    to write bytes into tmp_pack_XXXXXX, and does not bother to clean it
    if it runs into an error or signal. And that is where most of the
    time will be spent.

    So one option would be to just drop this signal handling entirely.
    But IMHO we should go the other way, and teach pack-objects to also
    clean up after itself. I know GitHub has been running with patches
    to call register_tempfile() like this for years (but it is hacky the
    way the above patch is; it never unregisters, which is why we never
    sent them upstream).

    Another problem is that repack.c only catches signals, not die(),
    which is the more likely culprit (e.g., you generate the "main"
    pack, and then something goes wrong while renaming or generating
    an ancillary cruft-packs). Switching to register_tempfile() does fix
    that, though.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-10-20 18:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-26 23:53 [PATCH] tmp-objdir: do not opendir() when handling a signal John Cai via GitGitGadget
2022-09-27  0:18 ` Taylor Blau
2022-09-27 11:48   ` Jeff King
2022-09-27  1:39 ` Junio C Hamano
2022-09-27  9:18 ` Phillip Wood
2022-09-27 11:44 ` Jeff King
2022-09-27 13:50   ` John Cai
2022-09-27 19:03     ` Jeff King
2022-09-27 16:50   ` Junio C Hamano
2022-09-27 19:19 ` [PATCH v2] tmp-objdir: skip clean up " John Cai via GitGitGadget
2022-09-27 19:38   ` Jeff King
2022-09-27 20:00     ` Jeff King
2022-09-28 14:55   ` [PATCH v3] " John Cai via GitGitGadget
2022-09-28 15:38     ` Ævar Arnfjörð Bjarmason
2022-09-30 20:47     ` [PATCH v4] " John Cai via GitGitGadget
2022-10-03  8:52       ` Jeff King
2022-10-20 11:58 ` Another possible instance of async-signal-safe opendir path callstack? (Was: [PATCH] tmp-objdir: do not opendir() when handling a signal) Jan Pokorný
2022-10-20 18:21   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).