git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] builtin/fetch.c: clean tmp pack after receive signal
@ 2021-03-16  2:53 SURA via GitGitGadget
  2021-03-16 21:54 ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: SURA via GitGitGadget @ 2021-03-16  2:53 UTC (permalink / raw)
  To: git; +Cc: SURA, SURA

From: SURA <sura907@hotmail.com>

In Gitee.com, I often use scripts to start a time-limited
"git fetch". When the execution time of'git fetch' is too
long, send a signal (such as SIGINT) to the'git fetch' process

But I found that after the process exits, there are some
files in the format of 'objects/pack/tmp_pack_XXXXXX' on the disk.
They are usually very large (some of them exceed 5G), and'git gc'
has no effect on them.

These fetch_tmp_pack files will gradually fill up the disk.
For a long time, I set F_WRLCK to the file to confirm whether the
file is opened by other processes, and delete the file if it is not opened

Obviously this is only a temporary solution, I think it should be fixed from git

I found fetch will start a 'index-pack' sub-process, this sub-process
create the tmp_pack
  1. make 'index-pack' unlink tmp_pack when get signal
  2. make 'fetch' send signal to 'index-pack' when get signal

Signed-off-by: SURA <sura907@hotmail.com>
---
    builtin/fetch.c: clean tmp pack after receive signal
    
    Clean tmp-pack after 'git fetch' process get signal(SIGINT, SIGHUP,
    SIGTERM, SIGQUIT or SIGPIPE)
    
    Only for operations using ‘wire protocol version 2’ and ssh
    
    Signed-off-by: SURA sura907@hotmail.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-890%2FSURA907%2Fgit-fetch-tmp-pack-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-890/SURA907/git-fetch-tmp-pack-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/890

 builtin/fetch-pack.c |  3 ++-
 builtin/fetch.c      | 10 ++++++++++
 builtin/index-pack.c | 18 ++++++++++++++++++
 fetch-pack.c         |  7 +++++--
 fetch-pack.h         |  3 ++-
 transport.c          |  3 ++-
 transport.h          |  6 ++++++
 7 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index c2d96f4c89ab..49a3e0bf2ae3 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -56,6 +56,7 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	struct string_list deepen_not = STRING_LIST_INIT_DUP;
 	struct packet_reader reader;
 	enum protocol_version version;
+	pid_t index_pack_pid = -1;
 
 	fetch_if_missing = 0;
 
@@ -232,7 +233,7 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	}
 
 	ref = fetch_pack(&args, fd, ref, sought, nr_sought,
-			 &shallow, pack_lockfiles_ptr, version);
+			 &shallow, pack_lockfiles_ptr, version, &index_pack_pid);
 	if (pack_lockfiles.nr) {
 		int i;
 
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0b90de87c7a2..a87efa23ceb5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -224,8 +224,18 @@ static void unlock_pack(void)
 		transport_unlock_pack(gsecondary);
 }
 
+static void send_signo_to_index_pack(int signo)
+{
+	if (gtransport && gtransport->index_pack_pid > 0)
+		kill(gtransport->index_pack_pid, signo);
+
+	if (gsecondary && gsecondary->index_pack_pid > 0)
+		kill(gsecondary->index_pack_pid, signo);
+}
+
 static void unlock_pack_on_signal(int signo)
 {
+	send_signo_to_index_pack(signo);
 	unlock_pack();
 	sigchain_pop(signo);
 	raise(signo);
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index bad57488079c..27d1ebba746e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -15,10 +15,13 @@
 #include "packfile.h"
 #include "object-store.h"
 #include "promisor-remote.h"
+#include "sigchain.h"
 
 static const char index_pack_usage[] =
 "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
+static const char *tmp_pack_name;
+
 struct object_entry {
 	struct pack_idx_entry idx;
 	unsigned long size;
@@ -336,6 +339,7 @@ static const char *open_pack_file(const char *pack_name)
 			output_fd = odb_mkstemp(&tmp_file,
 						"pack/tmp_pack_XXXXXX");
 			pack_name = strbuf_detach(&tmp_file, NULL);
+			tmp_pack_name = pack_name;
 		} else {
 			output_fd = open(pack_name, O_CREAT|O_EXCL|O_RDWR, 0600);
 			if (output_fd < 0)
@@ -353,6 +357,19 @@ static const char *open_pack_file(const char *pack_name)
 	return pack_name;
 }
 
+static void remove_tmp_pack(void)
+{
+	if (tmp_pack_name) 
+		unlink_or_warn(tmp_pack_name);
+}
+
+static void remove_tmp_pack_on_signal(int signo)
+{
+	remove_tmp_pack();
+	sigchain_pop(signo);
+	raise(signo);
+}
+
 static void parse_pack_header(void)
 {
 	struct pack_header *hdr = fill(sizeof(struct pack_header));
@@ -1911,6 +1928,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	}
 
 	curr_pack = open_pack_file(pack_name);
+	sigchain_push_common(remove_tmp_pack_on_signal);
 	parse_pack_header();
 	objects = xcalloc(st_add(nr_objects, 1), sizeof(struct object_entry));
 	if (show_stat)
diff --git a/fetch-pack.c b/fetch-pack.c
index 6a61a464283e..bedbf69710f6 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -40,6 +40,7 @@ static struct shallow_lock shallow_lock;
 static const char *alternate_shallow_file;
 static struct strbuf fsck_msg_types = STRBUF_INIT;
 static struct string_list uri_protocols = STRING_LIST_INIT_DUP;
+static pid_t *curr_index_pack_pid;
 
 /* Remember to update object flag allocation in object.h */
 #define COMPLETE	(1U << 0)
@@ -944,6 +945,7 @@ static int get_pack(struct fetch_pack_args *args,
 	cmd.git_cmd = 1;
 	if (start_command(&cmd))
 		die(_("fetch-pack: unable to fork off %s"), cmd_name);
+	*curr_index_pack_pid = cmd.pid;
 	if (do_keep && (pack_lockfiles || fsck_objects)) {
 		int is_well_formed;
 		char *pack_lockfile = index_pack_lockfile(cmd.out, &is_well_formed);
@@ -1945,12 +1947,13 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 		       struct ref **sought, int nr_sought,
 		       struct oid_array *shallow,
 		       struct string_list *pack_lockfiles,
-		       enum protocol_version version)
+		       enum protocol_version version,
+		       pid_t *index_pack_pid)
 {
 	struct ref *ref_cpy;
 	struct shallow_info si;
 	struct oid_array shallows_scratch = OID_ARRAY_INIT;
-
+	curr_index_pack_pid = index_pack_pid;
 	fetch_pack_setup();
 	if (nr_sought)
 		nr_sought = remove_duplicates_in_refs(sought, nr_sought);
diff --git a/fetch-pack.h b/fetch-pack.h
index 736a3dae467a..c05a2544bb06 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -78,7 +78,8 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 		       int nr_sought,
 		       struct oid_array *shallow,
 		       struct string_list *pack_lockfiles,
-		       enum protocol_version version);
+		       enum protocol_version version,
+			   pid_t *index_pack_pid);
 
 /*
  * Print an appropriate error message for each sought ref that wasn't
diff --git a/transport.c b/transport.c
index b13fab5dc3b1..b18f62e6bd23 100644
--- a/transport.c
+++ b/transport.c
@@ -391,7 +391,7 @@ static int fetch_refs_via_pack(struct transport *transport,
 	refs = fetch_pack(&args, data->fd,
 			  refs_tmp ? refs_tmp : transport->remote_refs,
 			  to_fetch, nr_heads, &data->shallow,
-			  &transport->pack_lockfiles, data->version);
+			  &transport->pack_lockfiles, data->version, &transport->index_pack_pid);
 
 	close(data->fd[0]);
 	close(data->fd[1]);
@@ -1095,6 +1095,7 @@ struct transport *transport_get(struct remote *remote, const char *url)
 	}
 
 	ret->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	ret->index_pack_pid = -1;
 
 	return ret;
 }
diff --git a/transport.h b/transport.h
index 24e15799e714..d25d76125784 100644
--- a/transport.h
+++ b/transport.h
@@ -118,6 +118,12 @@ struct transport {
 	enum transport_family family;
 
 	const struct git_hash_algo *hash_algo;
+
+	/*
+	 * Record the pid of the index_pack sub-process
+	 * used to send the signal to this sub-process after the main fetch process receives the signal
+	 */
+	pid_t index_pack_pid;
 };
 
 #define TRANSPORT_PUSH_ALL			(1<<0)

base-commit: 13d7ab6b5d7929825b626f050b62a11241ea4945
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] builtin/fetch.c: clean tmp pack after receive signal
  2021-03-16  2:53 [PATCH] builtin/fetch.c: clean tmp pack after receive signal SURA via GitGitGadget
@ 2021-03-16 21:54 ` Jeff King
  2021-03-17 18:15   ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2021-03-16 21:54 UTC (permalink / raw)
  To: SURA via GitGitGadget; +Cc: git, SURA

On Tue, Mar 16, 2021 at 02:53:36AM +0000, SURA via GitGitGadget wrote:

> In Gitee.com, I often use scripts to start a time-limited

Not related to your patch, but I think this name falls afoul of Git's
trademark policy. See:

  https://git-scm.com/trademark

There's also some discussion in this thread:

  https://lore.kernel.org/git/20170202022655.2jwvudhvo4hmueaw@sigill.intra.peff.net/

> "git fetch". When the execution time of'git fetch' is too
> long, send a signal (such as SIGINT) to the'git fetch' process
> 
> But I found that after the process exits, there are some
> files in the format of 'objects/pack/tmp_pack_XXXXXX' on the disk.
> They are usually very large (some of them exceed 5G), and'git gc'
> has no effect on them.

This isn't quite true. "git gc" will clean up the temporary files, but
only if the mtime is sufficiently old. The purpose here is to give a
grace period to avoid deleting a file that is actively being written to.
However, we use the same grace period that we use for deleting
unreachable objects, which is absurdly long for this purpose: 2 weeks.
Probably something like an hour would be more appropriate (since the
mtime is updated on each write, this would imply a process not making
forward progress).

That said...

> Obviously this is only a temporary solution, I think it should be fixed from git
> 
> I found fetch will start a 'index-pack' sub-process, this sub-process
> create the tmp_pack
>   1. make 'index-pack' unlink tmp_pack when get signal
>   2. make 'fetch' send signal to 'index-pack' when get signal

I do think this is worth doing. One of the reasons we haven't
traditionally cleaned up temporary packfiles is that the failed state is
sometimes useful for analyzing what went wrong, or even recovering
partial data. But that probably should not be the default mode of
operation (i.e., a config option or environment variable should probably
be able to turn it on for debugging).

Looking at the implementation itself...

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 0b90de87c7a2..a87efa23ceb5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -224,8 +224,18 @@ static void unlock_pack(void)
>  		transport_unlock_pack(gsecondary);
>  }
>  
> +static void send_signo_to_index_pack(int signo)
> +{
> +	if (gtransport && gtransport->index_pack_pid > 0)
> +		kill(gtransport->index_pack_pid, signo);
> +
> +	if (gsecondary && gsecondary->index_pack_pid > 0)
> +		kill(gsecondary->index_pack_pid, signo);
> +}
> +
>  static void unlock_pack_on_signal(int signo)
>  {
> +	send_signo_to_index_pack(signo);
>  	unlock_pack();
>  	sigchain_pop(signo);
>  	raise(signo);

We'd probably want to also trigger this if we just hit die(), I'd think.
We have a system for killing process children when we exit unexpectedly.
I think just setting the "clean_on_exit" flag on the index-pack
child_process struct would turn this into a one-liner.

Likewise, we'd probably want to do this from receive-pack, too (so that
pushes don't accumulate temporary packfiles on the receiving side).

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index bad57488079c..27d1ebba746e 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> [...]
> +static void remove_tmp_pack(void)
> +{
> +	if (tmp_pack_name) 
> +		unlink_or_warn(tmp_pack_name);
> +}
> +
> +static void remove_tmp_pack_on_signal(int signo)
> +{
> +	remove_tmp_pack();
> +	sigchain_pop(signo);
> +	raise(signo);
> +}

Likewise, we have a tempfile cleanup system already.

I think this hunk:

> @@ -336,6 +339,7 @@ static const char *open_pack_file(const char *pack_name)
>  			output_fd = odb_mkstemp(&tmp_file,
>  						"pack/tmp_pack_XXXXXX");
>  			pack_name = strbuf_detach(&tmp_file, NULL);
> +			tmp_pack_name = pack_name;

...can just call register_tempfile(). It should also record the result
so that we don't try to unlink() it after we've already moved it away
from its temporary name (though it's fairly unlikely for somebody else
to have used the name in the interim).

I think you'd want to do the same for the tmp_idx_* files, too. Likewise
for ".rev" files we create starting in v2.31.

I think it would also make sense in create_tmp_packfile(), which is used
during repacking (a different problem space, but really the same thing:
if repacking fails for some reason, we probably shouldn't leave a
useless gigantic half-finished packfile on disk).

We should possibly also do so for tmp_obj_* files. Those can be written
for a fetch or push via unpack-objects (as well as normal local
commands). They're not usually as big as a pack, obviously, but I think
the same principle applies.

> [...]

It would be nice to see some tests covering this functionality, too.
Reproducing it with signals is likely to be racy and not worth it. But I
think that right now index-pack reading a bogus pack (say, one that
fails fsck checks) will leave the tmp_pack_* on disk. And it would not
if we cleanup tempfiles (again, this would be on any exit, not just
signal death, but I think that is what we'd want, and also what
register_tempfile() will do).

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] builtin/fetch.c: clean tmp pack after receive signal
  2021-03-16 21:54 ` Jeff King
@ 2021-03-17 18:15   ` Junio C Hamano
  0 siblings, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2021-03-17 18:15 UTC (permalink / raw)
  To: Jeff King; +Cc: SURA via GitGitGadget, git, SURA

Jeff King <peff@peff.net> writes:

> On Tue, Mar 16, 2021 at 02:53:36AM +0000, SURA via GitGitGadget wrote:
>
>> In Gitee.com, I often use scripts to start a time-limited
>
> Not related to your patch, but I think this name falls afoul of Git's
> trademark policy. See:
>
>   https://git-scm.com/trademark
>
> There's also some discussion in this thread:
>
>   https://lore.kernel.org/git/20170202022655.2jwvudhvo4hmueaw@sigill.intra.peff.net/

Thanks.  On somewhat related to this patch, we also ask contributors
to use their real names so that we do not render the Signed-off-by:
procedure meaningless.

> This isn't quite true. "git gc" will clean up the temporary files, but
> only if the mtime is sufficiently old. The purpose here is to give a
> grace period to avoid deleting a file that is actively being written to.
> However, we use the same grace period that we use for deleting
> unreachable objects, which is absurdly long for this purpose: 2 weeks.
> Probably something like an hour would be more appropriate (since the
> mtime is updated on each write, this would imply a process not making
> forward progress).

I agree that for temporaries the two-week default is way too long,
and I am OK if we decide to shorten the expiration for them
separately from the known-to-be-good-but-unreferenced objects.

> Likewise, we have a tempfile cleanup system already.
>
> I think this hunk:
>
>> @@ -336,6 +339,7 @@ static const char *open_pack_file(const char *pack_name)
>>  			output_fd = odb_mkstemp(&tmp_file,
>>  						"pack/tmp_pack_XXXXXX");
>>  			pack_name = strbuf_detach(&tmp_file, NULL);
>> +			tmp_pack_name = pack_name;
>
> ...can just call register_tempfile(). It should also record the result
> so that we don't try to unlink() it after we've already moved it away
> from its temporary name (though it's fairly unlikely for somebody else
> to have used the name in the interim).
>
> I think you'd want to do the same for the tmp_idx_* files, too. Likewise
> for ".rev" files we create starting in v2.31.
>
> I think it would also make sense in create_tmp_packfile(), which is used
> during repacking (a different problem space, but really the same thing:
> if repacking fails for some reason, we probably shouldn't leave a
> useless gigantic half-finished packfile on disk).
>
> We should possibly also do so for tmp_obj_* files. Those can be written
> for a fetch or push via unpack-objects (as well as normal local
> commands). They're not usually as big as a pack, obviously, but I think
> the same principle applies.
>
>> [...]
>
> It would be nice to see some tests covering this functionality, too.
> Reproducing it with signals is likely to be racy and not worth it. But I
> think that right now index-pack reading a bogus pack (say, one that
> fails fsck checks) will leave the tmp_pack_* on disk. And it would not
> if we cleanup tempfiles (again, this would be on any exit, not just
> signal death, but I think that is what we'd want, and also what
> register_tempfile() will do).

Sounds like a good medium difficulty leftover bit.

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-03-17 18:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16  2:53 [PATCH] builtin/fetch.c: clean tmp pack after receive signal SURA via GitGitGadget
2021-03-16 21:54 ` Jeff King
2021-03-17 18:15   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).