git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] repack: respect gc.pid lock
@ 2017-04-13 20:27 David Turner
  2017-04-14  0:33 ` Jacob Keller
  2017-04-14 19:33 ` Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: David Turner @ 2017-04-13 20:27 UTC (permalink / raw)
  To: git; +Cc: christian.couder, mfick, jacob.keller, David Turner

Git gc locks the repository (using a gc.pid file) so that other gcs
don't run concurrently. Make git repack respect this lock.

Now repack, by default, will refuse to run at the same time as a gc.
This fixes a concurrency issue: a repack which deleted packs would
make a concurrent gc sad when its packs were deleted out from under
it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
cannot be accessed".  Then it would die, probably leaving a large temp
pack hanging around.

Git repack learns --no-lock, so that when run under git gc, it doesn't
attempt to manage the lock itself.

Martin Fick suggested just moving the lock into git repack, but this
would leave parts of git gc (e.g. git prune) protected by only local
locks.  I worried that a prune (part of git gc) concurrent with a
repack could confuse the repack, so I decided to go with this
solution.

Signed-off-by: David Turner <dturner@twosigma.com>
---
 Documentation/git-repack.txt |  5 +++
 Makefile                     |  1 +
 builtin/gc.c                 | 72 ++----------------------------------
 builtin/repack.c             | 13 +++++++
 repack.c                     | 88 ++++++++++++++++++++++++++++++++++++++++++++
 repack.h                     |  8 ++++
 t/t7700-repack.sh            |  8 ++++
 7 files changed, 127 insertions(+), 68 deletions(-)
 create mode 100644 repack.c
 create mode 100644 repack.h

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 26afe6ed54..b347ff5c62 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,11 @@ other objects in that pack they already have locally.
 	being removed. In addition, any unreachable loose objects will
 	be packed (and their loose counterparts removed).
 
+--no-lock::
+	Do not lock the repository, and do not respect any existing lock.
+	Mostly useful for running repack within git gc.  Do not use this
+	unless you know what you are doing.
+
 Configuration
 -------------
 
diff --git a/Makefile b/Makefile
index 9b36068ac5..7095f03959 100644
--- a/Makefile
+++ b/Makefile
@@ -816,6 +816,7 @@ LIB_OBJS += refs/files-backend.o
 LIB_OBJS += refs/iterator.o
 LIB_OBJS += ref-filter.o
 LIB_OBJS += remote.o
+LIB_OBJS += repack.o
 LIB_OBJS += replace_object.o
 LIB_OBJS += rerere.o
 LIB_OBJS += resolve-undo.o
diff --git a/builtin/gc.c b/builtin/gc.c
index c2c61a57bb..9b9c27020b 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -18,6 +18,7 @@
 #include "sigchain.h"
 #include "argv-array.h"
 #include "commit.h"
+#include "repack.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -45,7 +46,6 @@ static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
 
-static struct tempfile pidfile;
 static struct lock_file log_lock;
 
 static struct string_list pack_garbage = STRING_LIST_INIT_DUP;
@@ -234,70 +234,6 @@ static int need_to_gc(void)
 	return 1;
 }
 
-/* return NULL on success, else hostname running the gc */
-static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
-{
-	static struct lock_file lock;
-	char my_host[128];
-	struct strbuf sb = STRBUF_INIT;
-	struct stat st;
-	uintmax_t pid;
-	FILE *fp;
-	int fd;
-	char *pidfile_path;
-
-	if (is_tempfile_active(&pidfile))
-		/* already locked */
-		return NULL;
-
-	if (gethostname(my_host, sizeof(my_host)))
-		xsnprintf(my_host, sizeof(my_host), "unknown");
-
-	pidfile_path = git_pathdup("gc.pid");
-	fd = hold_lock_file_for_update(&lock, pidfile_path,
-				       LOCK_DIE_ON_ERROR);
-	if (!force) {
-		static char locking_host[128];
-		int should_exit;
-		fp = fopen(pidfile_path, "r");
-		memset(locking_host, 0, sizeof(locking_host));
-		should_exit =
-			fp != NULL &&
-			!fstat(fileno(fp), &st) &&
-			/*
-			 * 12 hour limit is very generous as gc should
-			 * never take that long. On the other hand we
-			 * don't really need a strict limit here,
-			 * running gc --auto one day late is not a big
-			 * problem. --force can be used in manual gc
-			 * after the user verifies that no gc is
-			 * running.
-			 */
-			time(NULL) - st.st_mtime <= 12 * 3600 &&
-			fscanf(fp, "%"SCNuMAX" %127c", &pid, locking_host) == 2 &&
-			/* be gentle to concurrent "gc" on remote hosts */
-			(strcmp(locking_host, my_host) || !kill(pid, 0) || errno == EPERM);
-		if (fp != NULL)
-			fclose(fp);
-		if (should_exit) {
-			if (fd >= 0)
-				rollback_lock_file(&lock);
-			*ret_pid = pid;
-			free(pidfile_path);
-			return locking_host;
-		}
-	}
-
-	strbuf_addf(&sb, "%"PRIuMAX" %s",
-		    (uintmax_t) getpid(), my_host);
-	write_in_full(fd, sb.buf, sb.len);
-	strbuf_release(&sb);
-	commit_lock_file(&lock);
-	register_tempfile(&pidfile, pidfile_path);
-	free(pidfile_path);
-	return NULL;
-}
-
 static int report_last_gc_error(void)
 {
 	struct strbuf sb = STRBUF_INIT;
@@ -370,7 +306,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	argv_array_pushl(&pack_refs_cmd, "pack-refs", "--all", "--prune", NULL);
 	argv_array_pushl(&reflog, "reflog", "expire", "--all", NULL);
-	argv_array_pushl(&repack, "repack", "-d", "-l", NULL);
+	argv_array_pushl(&repack, "repack", "-d", "-l", "--no-lock", NULL);
 	argv_array_pushl(&prune, "prune", "--expire", NULL);
 	argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", NULL);
 	argv_array_pushl(&rerere, "rerere", "gc", NULL);
@@ -426,11 +362,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	} else
 		add_repack_all_option();
 
-	name = lock_repo_for_gc(force, &pid);
+	name = lock_repo_for_pack_manipulation(force, &pid);
 	if (name) {
 		if (auto_gc)
 			return 0; /* be quiet on --auto */
-		die(_("gc is already running on machine '%s' pid %"PRIuMAX" (use --force if not)"),
+		die(_("pack operation (gc or repack) is already running on machine '%s' pid %"PRIuMAX" (use --force if not)"),
 		    name, (uintmax_t)pid);
 	}
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 677bc7c81a..619ac37a05 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -7,6 +7,7 @@
 #include "strbuf.h"
 #include "string-list.h"
 #include "argv-array.h"
+#include "repack.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -160,6 +161,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int no_lock = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -194,6 +196,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("maximum size of each packfile")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
+		OPT_BOOL(0, "no-lock", &no_lock,
+				N_("Do not lock the repository, and do not respect any existing lock.  Mostly useful for operation within git gc.")),
 		OPT_END()
 	};
 
@@ -215,6 +219,15 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
 		die(_(incremental_bitmap_conflict_error));
 
+	if (!no_lock) {
+		pid_t pid;
+		const char *name = lock_repo_for_pack_manipulation(0, &pid);
+		if (name) {
+			die(_("pack operation (gc or repack) is already running on machine '%s' pid %"PRIuMAX" (use --no-lock if not)"),
+			    name, (uintmax_t)pid);
+		}
+	}
+
 	packdir = mkpathdup("%s/pack", get_object_directory());
 	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
 
diff --git a/repack.c b/repack.c
new file mode 100644
index 0000000000..a6df28c7f2
--- /dev/null
+++ b/repack.c
@@ -0,0 +1,88 @@
+#include "builtin.h"
+#include "repack.h"
+#include "strbuf.h"
+#include "lockfile.h"
+#include "tempfile.h"
+
+static struct tempfile pidfile;
+
+/*
+ * Commands should call this before doing any operation which might
+ * delete a pack file (e.g. gc or repack).  We don't want to allow
+ * multiple operations of this type to operate at the same time.
+ *
+ * For historical reasons, the pid file created is called "gc.pid",
+ * even though it is also used for git-repack.
+ *
+ * The lock will persist until the process ends.
+ *
+ * If force is non-zero, any existing lock will be disregarded.
+ *
+ * return NULL on success, else hostname running the pack manipulation
+ * operation.
+ *
+ * It is safe to call this function multiple times in the same process;
+ * calls after the first successful call will always return NULL.
+ */
+const char *lock_repo_for_pack_manipulation(int force, pid_t* ret_pid)
+{
+	static struct lock_file lock;
+	char my_host[128];
+	struct strbuf sb = STRBUF_INIT;
+	struct stat st;
+	uintmax_t pid;
+	FILE *fp;
+	int fd;
+	char *pidfile_path;
+
+	if (is_tempfile_active(&pidfile))
+		/* already locked */
+		return NULL;
+
+	if (gethostname(my_host, sizeof(my_host)))
+		xsnprintf(my_host, sizeof(my_host), "unknown");
+
+	pidfile_path = git_pathdup("gc.pid");
+	fd = hold_lock_file_for_update(&lock, pidfile_path,
+				       LOCK_DIE_ON_ERROR);
+	if (!force) {
+		static char locking_host[128];
+		int should_exit;
+		fp = fopen(pidfile_path, "r");
+		memset(locking_host, 0, sizeof(locking_host));
+		should_exit =
+			fp != NULL &&
+			!fstat(fileno(fp), &st) &&
+			/*
+			 * 12 hour limit is very generous as gc should
+			 * never take that long. On the other hand we
+			 * don't really need a strict limit here,
+			 * running gc --auto one day late is not a big
+			 * problem. --force can be used in manual gc
+			 * after the user verifies that no gc is
+			 * running.
+			 */
+			time(NULL) - st.st_mtime <= 12 * 3600 &&
+			fscanf(fp, "%"SCNuMAX" %127c", &pid, locking_host) == 2 &&
+			/* be gentle to concurrent "gc" on remote hosts */
+			(strcmp(locking_host, my_host) || !kill(pid, 0) || errno == EPERM);
+		if (fp != NULL)
+			fclose(fp);
+		if (should_exit) {
+			if (fd >= 0)
+				rollback_lock_file(&lock);
+			*ret_pid = pid;
+			free(pidfile_path);
+			return locking_host;
+		}
+	}
+
+	strbuf_addf(&sb, "%"PRIuMAX" %s",
+		    (uintmax_t) getpid(), my_host);
+	write_in_full(fd, sb.buf, sb.len);
+	strbuf_release(&sb);
+	commit_lock_file(&lock);
+	register_tempfile(&pidfile, pidfile_path);
+	free(pidfile_path);
+	return NULL;
+}
diff --git a/repack.h b/repack.h
new file mode 100644
index 0000000000..bf9144ee37
--- /dev/null
+++ b/repack.h
@@ -0,0 +1,8 @@
+#ifndef REPACK_H
+#define REPACK_H
+
+#include "git-compat-util.h"
+
+const char *lock_repo_for_pack_manipulation(int force, pid_t* ret_pid);
+
+#endif /* REPACK_H */
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 6061a04147..52f19c5871 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -196,5 +196,13 @@ test_expect_success 'objects made unreachable by grafts only are kept' '
 	git cat-file -t $H1
 	'
 
+test_expect_success 'repack respects gc.pid' '
+	test_tick &&
+	test_when_finished "rm -f .git/gc.pid" &&
+	echo -n "1234 hostname" >.git/gc.pid &&
+	test_must_fail git repack -a -d 2>err &&
+	test_i18ngrep "already running on machine .hostname. pid 1234" err
+	'
+
 test_done
 
-- 
2.11.GIT


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-13 20:27 [PATCH] repack: respect gc.pid lock David Turner
@ 2017-04-14  0:33 ` Jacob Keller
  2017-04-14 19:33 ` Jeff King
  1 sibling, 0 replies; 13+ messages in thread
From: Jacob Keller @ 2017-04-14  0:33 UTC (permalink / raw)
  To: David Turner; +Cc: Git mailing list, Christian Couder, Martin Fick

On Thu, Apr 13, 2017 at 1:27 PM, David Turner <dturner@twosigma.com> wrote:
> Git gc locks the repository (using a gc.pid file) so that other gcs
> don't run concurrently. Make git repack respect this lock.
>
> Now repack, by default, will refuse to run at the same time as a gc.
> This fixes a concurrency issue: a repack which deleted packs would
> make a concurrent gc sad when its packs were deleted out from under
> it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
> cannot be accessed".  Then it would die, probably leaving a large temp
> pack hanging around.
>
> Git repack learns --no-lock, so that when run under git gc, it doesn't
> attempt to manage the lock itself.
>
> Martin Fick suggested just moving the lock into git repack, but this
> would leave parts of git gc (e.g. git prune) protected by only local
> locks.  I worried that a prune (part of git gc) concurrent with a
> repack could confuse the repack, so I decided to go with this
> solution.
>

The last paragraph could be reworded to be a bit less personal and
more as a direct statement of why moving the lock entirely to repack
is a bad idea.

> Signed-off-by: David Turner <dturner@twosigma.com>
> ---
>  Documentation/git-repack.txt |  5 +++
>  Makefile                     |  1 +
>  builtin/gc.c                 | 72 ++----------------------------------
>  builtin/repack.c             | 13 +++++++
>  repack.c                     | 88 ++++++++++++++++++++++++++++++++++++++++++++
>  repack.h                     |  8 ++++
>  t/t7700-repack.sh            |  8 ++++
>  7 files changed, 127 insertions(+), 68 deletions(-)
>  create mode 100644 repack.c
>  create mode 100644 repack.h
>
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 26afe6ed54..b347ff5c62 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -143,6 +143,11 @@ other objects in that pack they already have locally.
>         being removed. In addition, any unreachable loose objects will
>         be packed (and their loose counterparts removed).
>
> +--no-lock::
> +       Do not lock the repository, and do not respect any existing lock.
> +       Mostly useful for running repack within git gc.  Do not use this
> +       unless you know what you are doing.
> +

I would have phrased this more like:

  Used internally by git gc to call git repack while already holding
the lock. Do not use unless you know what you're doing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-13 20:27 [PATCH] repack: respect gc.pid lock David Turner
  2017-04-14  0:33 ` Jacob Keller
@ 2017-04-14 19:33 ` Jeff King
  2017-04-17 23:29   ` David Turner
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-14 19:33 UTC (permalink / raw)
  To: David Turner; +Cc: git, christian.couder, mfick, jacob.keller

On Thu, Apr 13, 2017 at 04:27:12PM -0400, David Turner wrote:

> Git gc locks the repository (using a gc.pid file) so that other gcs
> don't run concurrently. Make git repack respect this lock.
> 
> Now repack, by default, will refuse to run at the same time as a gc.
> This fixes a concurrency issue: a repack which deleted packs would
> make a concurrent gc sad when its packs were deleted out from under
> it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
> cannot be accessed".  Then it would die, probably leaving a large temp
> pack hanging around.
> 
> Git repack learns --no-lock, so that when run under git gc, it doesn't
> attempt to manage the lock itself.

This also means that two repack invocations cannot run simultaneously,
because they want to take the same lock.  But depending on the options,
the two don't necessarily conflict. For example, two simultaneous
incremental "git repack -d" invocations should be able to complete.

Do we know where the error message is coming from? I couldn't find the
error message you've given above; grepping for "cannot be accessed"
shows only error messages that would have "packfile" after the "fatal:".
Is it a copy-paste error?

If that's the case, then it's the one in use_pack(). Do we know what
program/operation is causing the error? Having a simultaneous gc delete
a packfile is _supposed_ to work, through a combination of:

  1. Most sha1-access operations can re-scan the pack directory if they
     find the packfile went away.

  2. The pack-objects run by a simultaneous repack is somewhat special
     in that once it finds and commits to a copy of an object in a pack,
     we need to use exactly that pack, because we record its offset,
     delta representation, etc. Usually this works because we open and
     mmap the packfile before making that commitment, and open packfiles
     are only closed if you run out of file descriptors (which should
     only happen when you have a huge number of packs).

So I'm worried that this repack lock is going to regress some other
cases that run fine together. But I'm also worried that it's a band-aid
over a more subtle problem. If pack-objects is not able to run alongside
a gc, then you're also going to have problems serving fetches, and
obviously you wouldn't want to take a lock there.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] repack: respect gc.pid lock
  2017-04-14 19:33 ` Jeff King
@ 2017-04-17 23:29   ` David Turner
  2017-04-18  3:41     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: David Turner @ 2017-04-17 23:29 UTC (permalink / raw)
  To: 'Jeff King'
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com


> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Friday, April 14, 2017 3:34 PM
> To: David Turner <David.Turner@twosigma.com>
> Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> jacob.keller@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Thu, Apr 13, 2017 at 04:27:12PM -0400, David Turner wrote:
> 
> > Git gc locks the repository (using a gc.pid file) so that other gcs
> > don't run concurrently. Make git repack respect this lock.
> >
> > Now repack, by default, will refuse to run at the same time as a gc.
> > This fixes a concurrency issue: a repack which deleted packs would
> > make a concurrent gc sad when its packs were deleted out from under
> > it.  The gc would fail with: "fatal: ./objects/pack/pack-$sha.pack
> > cannot be accessed".  Then it would die, probably leaving a large temp
> > pack hanging around.
> >
> > Git repack learns --no-lock, so that when run under git gc, it doesn't
> > attempt to manage the lock itself.
> 
> This also means that two repack invocations cannot run simultaneously, because
> they want to take the same lock.  But depending on the options, the two don't
> necessarily conflict. For example, two simultaneous incremental "git repack -d"
> invocations should be able to complete.
> 
> Do we know where the error message is coming from? I couldn't find the error
> message you've given above; grepping for "cannot be accessed"
> shows only error messages that would have "packfile" after the "fatal:".
> Is it a copy-paste error?

Yes, it is.  Sorry.

We saw this failure in the logs multiple  times (with three different
shas, while a gc was running):
April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects' in [repo] failed:
fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
Possibly some other repack was also running at the time as well.

My colleague also saw it while manually doing gc (again while 
repacks were likely to be running):
$ git gc --aggressive
Counting objects: 13800073, done.
Delta compression using up to 8 threads.
Compressing objects:  99% (11465846/11465971)   
Compressing objects: 100% (11465971/11465971), done.
fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed

(yes, I know that --aggressive is usually not desirable)

> If that's the case, then it's the one in use_pack(). Do we know what
> program/operation is causing the error? Having a simultaneous gc delete a
> packfile is _supposed_ to work, through a combination of:
> 
>   1. Most sha1-access operations can re-scan the pack directory if they
>      find the packfile went away.
> 
>   2. The pack-objects run by a simultaneous repack is somewhat special
>      in that once it finds and commits to a copy of an object in a pack,
>      we need to use exactly that pack, because we record its offset,
>      delta representation, etc. Usually this works because we open and
>      mmap the packfile before making that commitment, and open packfiles
>      are only closed if you run out of file descriptors (which should
>      only happen when you have a huge number of packs).

We have a reasonable rlimit (64k soft limit), so that failure mode is pretty 
unlikely.  I  think we should have had 20 or so packs -- not tens of thousands.

> So I'm worried that this repack lock is going to regress some other cases that run
> fine together. But I'm also worried that it's a band-aid over a more subtle
> problem. If pack-objects is not able to run alongside a gc, then you're also going
> to have problems serving fetches, and obviously you wouldn't want to take a
> lock there.

I see your point.  I don't know if it's pack-objects that's seeing this, although maybe 
that's the only reasonable codepath.

I did some tracing through the code, and couldn't figure out how to trigger that 
error message.  It appears in two places in the code, but only when the pack is not 
initialized.  But the packs always seem to be set up by that point in my test runs.  
It's worth noting that I'm not testing on the gitlab server; I'm testing on my laptop with
a completely different repo.  But I've tried various ways to repro this -- or even to 
get to a point where those errors would have been thrown given a missing pack -- 
and I have not been able to.

Do you have any idea why this would be happening other than the rlimit thing?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-17 23:29   ` David Turner
@ 2017-04-18  3:41     ` Jeff King
  2017-04-18 17:08       ` David Turner
  2017-04-18 17:16       ` David Turner
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2017-04-18  3:41 UTC (permalink / raw)
  To: David Turner
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:

> We saw this failure in the logs multiple  times (with three different
> shas, while a gc was running):
> April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d --pack-kept-objects' in [repo] failed:
> fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> Possibly some other repack was also running at the time as well.
> 
> My colleague also saw it while manually doing gc (again while 
> repacks were likely to be running):

This is sort of a side question, but...why are you running other repacks
alongside git-gc? It seems like you ought to be doing one or the other.

I don't begrudge anybody with a complicated setup running their own set
of gc commands, but I'd think you would want to do locking there, and
disable auto-gc entirely. Otherwise you're going to get different
results depending on who gc'd last.

> $ git gc --aggressive
> Counting objects: 13800073, done.
> Delta compression using up to 8 threads.
> Compressing objects:  99% (11465846/11465971)   
> Compressing objects: 100% (11465971/11465971), done.
> fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed

OK, so this presumably happened during the writing phase. Which seems
like the "a pack was closed, and we couldn't re-open it" problem we've
seen before.

> We have a reasonable rlimit (64k soft limit), so that failure mode is pretty 
> unlikely.  I  think we should have had 20 or so packs -- not tens of thousands.
> [...]
> Do you have any idea why this would be happening other than the rlimit thing?

Yeah, that should be enough (you could double check the return of
get_max_fd_limit() on your system if you wanted to be paranoid).

We also keep only a limited number of bytes mmap'd at one time. Normally
we don't actually close packfiles when we release their mmap windows.
But I think there is one path that might. When use_pack() maps a pack,
if the entire pack fits in a single window, then we close it; this is
due to d131b7afe (sha1_file.c: Don't retain open fds on small packs,
2011-03-02).

But if we ever unmap that window, now we have no handle to the pack.
Normally on a 64-bit system this wouldn't happen at all, since the
default core.packedGitLimit is 8GB there.

So if you have a few small packs and one very large pack (over 8GB), I
think this could trigger. We may do the small-pack thing for some of
them, and then the large pack forces us to drop the mmaps for some of
the others. When we go back to access the small pack, we find it's gone.

One solution would be to bump core.packedGitLimit to something much
higher (it's an mmap, so we're really just chewing up address space;
it's up to the OS to decide when to load pages from disk and when to
drop them).

The other alternative is to disable the small-pack closing from
d131b7afe. It might need to be configurable, or perhaps auto-tuned based
on the fd limit. Linux systems tend to have generous descriptor limits,
but I'm not sure we can rely on that. OTOH, it seems like the code to
close descriptors when needed would take care of things. So maybe we
should just revert d131b7afe entirely.

The final thing I'd ask is whether you might be on a networked
filesystem that would foil our usual "open descriptors mean packs don't
go away" logic. But after having dug into the details above, I have a
feeling the answer is simply that you have repositories >8GB.

And if that is the case, then yeah, your locking patch is definitely a
band-aid. If you fetch and repack at the same time, you'll eventually
see a racy failed fetch.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] repack: respect gc.pid lock
  2017-04-18  3:41     ` Jeff King
@ 2017-04-18 17:08       ` David Turner
  2017-04-18 17:16         ` Jeff King
  2017-04-18 17:16       ` David Turner
  1 sibling, 1 reply; 13+ messages in thread
From: David Turner @ 2017-04-18 17:08 UTC (permalink / raw)
  To: 'Jeff King'
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Monday, April 17, 2017 11:42 PM
> To: David Turner <David.Turner@twosigma.com>
> Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> jacob.keller@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:
> 
> > We saw this failure in the logs multiple  times (with three different
> > shas, while a gc was running):
> > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d
> --pack-kept-objects' in [repo] failed:
> > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > Possibly some other repack was also running at the time as well.
> >
> > My colleague also saw it while manually doing gc (again while repacks
> > were likely to be running):
> 
> This is sort of a side question, but...why are you running other repacks alongside
> git-gc? It seems like you ought to be doing one or the other.
>
> I don't begrudge anybody with a complicated setup running their own set of gc
> commands, but I'd think you would want to do locking there, and disable auto-
> gc entirely. Otherwise you're going to get different results depending on who
> gc'd last.

That's what gitlab does, so you'll have to ask them why they do it that way.  
From https://gitlab.com/gitlab-org/gitlab-ce/issues/30939#note_27487981
 it looks like they may have intended to have a lock but not quite succeeded.
 
> > $ git gc --aggressive
> > Counting objects: 13800073, done.
> > Delta compression using up to 8 threads.
> > Compressing objects:  99% (11465846/11465971)
> > Compressing objects: 100% (11465971/11465971), done.
> > fatal: packfile [repo]/objects/pack/pack-[sha].pack cannot be accessed
> 
> OK, so this presumably happened during the writing phase. Which seems like the
> "a pack was closed, and we couldn't re-open it" problem we've seen before.
> 
> > We have a reasonable rlimit (64k soft limit), so that failure mode is
> > pretty unlikely.  I  think we should have had 20 or so packs -- not tens of
> thousands.
> > [...]
> > Do you have any idea why this would be happening other than the rlimit thing?
> 
> Yeah, that should be enough (you could double check the return of
> get_max_fd_limit() on your system if you wanted to be paranoid).
> 
> We also keep only a limited number of bytes mmap'd at one time. Normally we
> don't actually close packfiles when we release their mmap windows.
> But I think there is one path that might. When use_pack() maps a pack, if the
> entire pack fits in a single window, then we close it; this is due to d131b7afe
> (sha1_file.c: Don't retain open fds on small packs, 2011-03-02).
> 
> But if we ever unmap that window, now we have no handle to the pack.
> Normally on a 64-bit system this wouldn't happen at all, since the default
> core.packedGitLimit is 8GB there.

Aha, I missed that limit while messing around with the code.  That must be it.

> So if you have a few small packs and one very large pack (over 8GB), I think this
> could trigger. We may do the small-pack thing for some of them, and then the
> large pack forces us to drop the mmaps for some of the others. When we go
> back to access the small pack, we find it's gone.
> 
> One solution would be to bump core.packedGitLimit to something much higher
> (it's an mmap, so we're really just chewing up address space; it's up to the OS to
> decide when to load pages from disk and when to drop them).
>
> The other alternative is to disable the small-pack closing from d131b7afe. It
> might need to be configurable, or perhaps auto-tuned based on the fd limit.
> Linux systems tend to have generous descriptor limits, but I'm not sure we can
> rely on that. OTOH, it seems like the code to close descriptors when needed
> would take care of things. So maybe we should just revert d131b7afe entirely.

I definitely remember running into fd limits when processing very large numbers 
of packs at Twitter, but I don't recall the exact details.  Presumably, d131b7afe
was supposed to help with this, but in fact, it did not totally solve it. Perhaps 
we were doing something funny.  Adjusting the fd limits was the easy fix.

On 64-bit systems, I think core.packedGitLimit doesn't make a 
lot of sense. There is plenty of address space.  Why not use it?

For 32-bit systems, of course, address space is more precious.

I'll ask our git server administrator to adjust core.packedGitLimit
and turn repacks back on to see if that fixes the issue.

> The final thing I'd ask is whether you might be on a networked filesystem that
> would foil our usual "open descriptors mean packs don't go away" logic. But
> after having dug into the details above, I have a feeling the answer is simply that
> you have repositories >8GB.

Yes, our repo is >8GB, and no, it's not on a networked filesystem.

> And if that is the case, then yeah, your locking patch is definitely a band-aid. If
> you fetch and repack at the same time, you'll eventually see a racy failed fetch.

Fair enough.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-18 17:08       ` David Turner
@ 2017-04-18 17:16         ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2017-04-18 17:16 UTC (permalink / raw)
  To: David Turner
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

On Tue, Apr 18, 2017 at 05:08:14PM +0000, David Turner wrote:

> On 64-bit systems, I think core.packedGitLimit doesn't make a 
> lot of sense. There is plenty of address space.  Why not use it?

That's my gut feeling, too. I'd have a slight worry that the OS's paging
behavior may respond differently if we have more memory mapped. But
that's not based on numbers, just a fear of the unknown. :)

If we have infinite windows anyway, I suspect we could just mmap entire
packfiles and forget about all the window complexity in the first place.
Although IIRC some operating systems take a long time to set up large
mmaps, and we may only need a small part of a large pack.

> I'll ask our git server administrator to adjust core.packedGitLimit
> and turn repacks back on to see if that fixes the issue.

Thanks. Let us know if you get any results, either way.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] repack: respect gc.pid lock
  2017-04-18  3:41     ` Jeff King
  2017-04-18 17:08       ` David Turner
@ 2017-04-18 17:16       ` David Turner
  2017-04-18 17:19         ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: David Turner @ 2017-04-18 17:16 UTC (permalink / raw)
  To: 'Jeff King'
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Monday, April 17, 2017 11:42 PM
> To: David Turner <David.Turner@twosigma.com>
> Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> jacob.keller@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:
> 
> > We saw this failure in the logs multiple  times (with three different
> > shas, while a gc was running):
> > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d
> --pack-kept-objects' in [repo] failed:
> > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > Possibly some other repack was also running at the time as well.
> >
> > My colleague also saw it while manually doing gc (again while repacks
> > were likely to be running):
> 
> This is sort of a side question, but...why are you running other repacks alongside
> git-gc? It seems like you ought to be doing one or the other.

But actually, it would be kind of nice if git would help protect us from doing this?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-18 17:16       ` David Turner
@ 2017-04-18 17:19         ` Jeff King
  2017-04-18 17:43           ` David Turner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-18 17:19 UTC (permalink / raw)
  To: David Turner
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

On Tue, Apr 18, 2017 at 05:16:52PM +0000, David Turner wrote:

> > -----Original Message-----
> > From: Jeff King [mailto:peff@peff.net]
> > Sent: Monday, April 17, 2017 11:42 PM
> > To: David Turner <David.Turner@twosigma.com>
> > Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> > jacob.keller@gmail.com
> > Subject: Re: [PATCH] repack: respect gc.pid lock
> > 
> > On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:
> > 
> > > We saw this failure in the logs multiple  times (with three different
> > > shas, while a gc was running):
> > > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true repack -A -d
> > --pack-kept-objects' in [repo] failed:
> > > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > > Possibly some other repack was also running at the time as well.
> > >
> > > My colleague also saw it while manually doing gc (again while repacks
> > > were likely to be running):
> > 
> > This is sort of a side question, but...why are you running other repacks alongside
> > git-gc? It seems like you ought to be doing one or the other.
> 
> But actually, it would be kind of nice if git would help protect us from doing this?

A lock can catch the racy cases where both run at the same time. But I
think that even:

  git -c repack.writeBitmaps=true repack -Ad
  [...wait...]
  git gc

is questionable, because that gc will erase your bitmaps. How does
git-gc know that it's doing a bad thing by repacking without bitmaps,
and that you didn't simply change your configuration or want to get rid
of them?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] repack: respect gc.pid lock
  2017-04-18 17:19         ` Jeff King
@ 2017-04-18 17:43           ` David Turner
  2017-04-18 17:50             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: David Turner @ 2017-04-18 17:43 UTC (permalink / raw)
  To: 'Jeff King'
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com



> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, April 18, 2017 1:20 PM
> To: David Turner <David.Turner@twosigma.com>
> Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> jacob.keller@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Tue, Apr 18, 2017 at 05:16:52PM +0000, David Turner wrote:
> 
> > > -----Original Message-----
> > > From: Jeff King [mailto:peff@peff.net]
> > > Sent: Monday, April 17, 2017 11:42 PM
> > > To: David Turner <David.Turner@twosigma.com>
> > > Cc: git@vger.kernel.org; christian.couder@gmail.com;
> > > mfick@codeaurora.org; jacob.keller@gmail.com
> > > Subject: Re: [PATCH] repack: respect gc.pid lock
> > >
> > > On Mon, Apr 17, 2017 at 11:29:18PM +0000, David Turner wrote:
> > >
> > > > We saw this failure in the logs multiple  times (with three
> > > > different shas, while a gc was running):
> > > > April 12, 2017 06:45 -> ERROR -> 'git -c repack.writeBitmaps=true
> > > > repack -A -d
> > > --pack-kept-objects' in [repo] failed:
> > > > fatal: packfile ./objects/pack/pack-[sha].pack cannot be accessed
> > > > Possibly some other repack was also running at the time as well.
> > > >
> > > > My colleague also saw it while manually doing gc (again while
> > > > repacks were likely to be running):
> > >
> > > This is sort of a side question, but...why are you running other
> > > repacks alongside git-gc? It seems like you ought to be doing one or the
> other.
> >
> > But actually, it would be kind of nice if git would help protect us from doing
> this?
> 
> A lock can catch the racy cases where both run at the same time. But I think that
> even:
> 
>   git -c repack.writeBitmaps=true repack -Ad
>   [...wait...]
>   git gc
> 
> is questionable, because that gc will erase your bitmaps. How does git-gc know
> that it's doing a bad thing by repacking without bitmaps, and that you didn't
> simply change your configuration or want to get rid of them?

Sorry, the gc in Gitlab does keep bitmaps.  The one I quoted in a previous 
message  doesn't, because the person typing the command was just doing some 
manual  testing and I guess didn't realize that bitmaps were important.  Or 
perhaps he knew that repack.writeBitmaps was already set in the config.

So given that the lock will catch the races, might it be a good idea (if 
Implemented to avoid locking on repack -d)?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-18 17:43           ` David Turner
@ 2017-04-18 17:50             ` Jeff King
  2017-04-20 20:10               ` David Turner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-18 17:50 UTC (permalink / raw)
  To: David Turner
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

On Tue, Apr 18, 2017 at 05:43:29PM +0000, David Turner wrote:

> > A lock can catch the racy cases where both run at the same time. But I think that
> > even:
> > 
> >   git -c repack.writeBitmaps=true repack -Ad
> >   [...wait...]
> >   git gc
> > 
> > is questionable, because that gc will erase your bitmaps. How does git-gc know
> > that it's doing a bad thing by repacking without bitmaps, and that you didn't
> > simply change your configuration or want to get rid of them?
> 
> Sorry, the gc in Gitlab does keep bitmaps.  The one I quoted in a previous 
> message  doesn't, because the person typing the command was just doing some 
> manual  testing and I guess didn't realize that bitmaps were important.  Or 
> perhaps he knew that repack.writeBitmaps was already set in the config.

Sure, but I guess I'd just wonder what _else_ is different between the
commands (and if nothing, why are both running).

> So given that the lock will catch the races, might it be a good idea (if 
> Implemented to avoid locking on repack -d)?

I'm mildly negative just because it increases complexity, and I don't
think it's actually buying very much. It's not clear to me which
invocations of repack would want to lock and which ones wouldn't.

Is "-a" or "-A" the key factor? Are there current callers who prefer the
current behavior of "possibly duplicate some work, but never report
failure" versus "do not duplicate work, but sometimes fail due to lock
contention"?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] repack: respect gc.pid lock
  2017-04-18 17:50             ` Jeff King
@ 2017-04-20 20:10               ` David Turner
  2017-04-20 20:14                 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: David Turner @ 2017-04-20 20:10 UTC (permalink / raw)
  To: 'Jeff King'
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

> -----Original Message-----
> From: Jeff King [mailto:peff@peff.net]
> Sent: Tuesday, April 18, 2017 1:50 PM
> To: David Turner <David.Turner@twosigma.com>
> Cc: git@vger.kernel.org; christian.couder@gmail.com; mfick@codeaurora.org;
> jacob.keller@gmail.com
> Subject: Re: [PATCH] repack: respect gc.pid lock
> 
> On Tue, Apr 18, 2017 at 05:43:29PM +0000, David Turner wrote:
> 
> > > A lock can catch the racy cases where both run at the same time. But
> > > I think that
> > > even:
> > >
> > >   git -c repack.writeBitmaps=true repack -Ad
> > >   [...wait...]
> > >   git gc
> > >
> > > is questionable, because that gc will erase your bitmaps. How does
> > > git-gc know that it's doing a bad thing by repacking without
> > > bitmaps, and that you didn't simply change your configuration or want to get
> rid of them?
> >
> > Sorry, the gc in Gitlab does keep bitmaps.  The one I quoted in a
> > previous message  doesn't, because the person typing the command was
> > just doing some manual  testing and I guess didn't realize that
> > bitmaps were important.  Or perhaps he knew that repack.writeBitmaps was
> already set in the config.
> 
> Sure, but I guess I'd just wonder what _else_ is different between the commands
> (and if nothing, why are both running).

Presumably, repack is faster, and they're not intended to run concurrently (but 
there's a Gitlab bug causing them to do so).  But you'll have to ask the Gitlab 
folks for more details.

> > So given that the lock will catch the races, might it be a good idea
> > (if Implemented to avoid locking on repack -d)?
> 
> I'm mildly negative just because it increases complexity, and I don't think it's
> actually buying very much. It's not clear to me which invocations of repack
> would want to lock and which ones wouldn't.
> 
> Is "-a" or "-A" the key factor? Are there current callers who prefer the current
> behavior of "possibly duplicate some work, but never report failure" versus "do
> not duplicate work, but sometimes fail due to lock contention"?

One problem with failing is that it can leave a temp pack behind.

I think the correct fix is to change the default code.packedGitLimit on 64-bit 
machines to 32 terabytes (2**45 bytes).  That's because on modern Intel 
processors, there are 48 bits of address space actually available, but the kernel 
is going to probably reserve a few bits.  My machine claims to have 2**46 bytes 
of virtual address space available.  It's also several times bigger than any 
repo that I know of or can easily imagine.

Does that seem reasonable to you?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] repack: respect gc.pid lock
  2017-04-20 20:10               ` David Turner
@ 2017-04-20 20:14                 ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2017-04-20 20:14 UTC (permalink / raw)
  To: David Turner
  Cc: git@vger.kernel.org, christian.couder@gmail.com,
	mfick@codeaurora.org, jacob.keller@gmail.com

On Thu, Apr 20, 2017 at 08:10:24PM +0000, David Turner wrote:

> > Is "-a" or "-A" the key factor? Are there current callers who prefer the current
> > behavior of "possibly duplicate some work, but never report failure" versus "do
> > not duplicate work, but sometimes fail due to lock contention"?
> 
> One problem with failing is that it can leave a temp pack behind.

Yeah. IMHO we should probably treat failed object and pack writes as
normal tempfiles and remove them (but possibly respect a "debug mode"
that leaves them around). But that's another patch entirely.

> I think the correct fix is to change the default code.packedGitLimit on 64-bit 
> machines to 32 terabytes (2**45 bytes).  That's because on modern Intel 
> processors, there are 48 bits of address space actually available, but the kernel 
> is going to probably reserve a few bits.  My machine claims to have 2**46 bytes 
> of virtual address space available.  It's also several times bigger than any 
> repo that I know of or can easily imagine.
> 
> Does that seem reasonable to you?

Yes, it does.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-20 20:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 20:27 [PATCH] repack: respect gc.pid lock David Turner
2017-04-14  0:33 ` Jacob Keller
2017-04-14 19:33 ` Jeff King
2017-04-17 23:29   ` David Turner
2017-04-18  3:41     ` Jeff King
2017-04-18 17:08       ` David Turner
2017-04-18 17:16         ` Jeff King
2017-04-18 17:16       ` David Turner
2017-04-18 17:19         ` Jeff King
2017-04-18 17:43           ` David Turner
2017-04-18 17:50             ` Jeff King
2017-04-20 20:10               ` David Turner
2017-04-20 20:14                 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).