git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: sandals@crustytoothpaste.net, steadmon@google.com,
	jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
	phillip.wood123@gmail.com, emilyshaffer@google.com,
	sluongng@gmail.com, jonathantanmy@google.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v2 7/9] maintenance: add incremental-repack task
Date: Tue, 18 Aug 2020 14:25:28 +0000	[thread overview]
Message-ID: <a8d956dad6b3a81d0f1b63cbd48f36a5e1195ae8.1597760730.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.696.v2.git.1597760730.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'incremental-repack' task runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximates how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the incremental-repack task will delete these
   repacked pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  15 ++++
 builtin/gc.c                      | 113 ++++++++++++++++++++++++++++++
 midx.c                            |   2 +-
 midx.h                            |   1 +
 t/t5319-multi-pack-index.sh       |   1 +
 t/t7900-maintenance.sh            |  38 ++++++++++
 6 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 85205b5bb9..b0b73b7185 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -88,6 +88,21 @@ loose-objects::
 	advisable to enable both the `loose-objects` and `gc` tasks at the
 	same time.
 
+incremental-repack::
+	The `incremental-repack` job repacks the object directory
+	using the `multi-pack-index` feature. In order to prevent race
+	conditions with concurrent Git commands, it follows a two-step
+	process. First, it deletes any pack-files included in the
+	`multi-pack-index` where none of the objects in the
+	`multi-pack-index` reference those pack-files; this only happens
+	if all objects in the pack-file are also stored in a newer
+	pack-file. Second, it selects a group of pack-files whose "expected
+	size" is below the batch size until the group has total expected
+	size at least the batch size; see the `--batch-size` option for
+	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
+	default batch-size is zero, which is a special case that attempts
+	to repack all pack-files into a single pack-file.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 61d5752af5..9aabef44a8 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -30,6 +30,7 @@
 #include "promisor-remote.h"
 #include "refs.h"
 #include "remote.h"
+#include "midx.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -1028,6 +1029,113 @@ static int maintenance_task_loose_objects(struct maintenance_opts *opts)
 	return prune_packed(opts) || pack_loose(opts);
 }
 
+static int multi_pack_index_write(struct maintenance_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	if (run_command(&child))
+		return error(_("failed to write multi-pack-index"));
+
+	return 0;
+}
+
+static int rewrite_multi_pack_index(struct maintenance_opts *opts)
+{
+	struct repository *r = the_repository;
+	char *midx_name = get_midx_filename(r->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	return multi_pack_index_write(opts);
+}
+
+static int multi_pack_index_verify(struct maintenance_opts *opts,
+				   const char *message)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "verify", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	if (run_command(&child)) {
+		warning(_("'git multi-pack-index verify' failed %s"), message);
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_expire(struct maintenance_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	close_object_store(the_repository->objects);
+
+	if (run_command(&child))
+		return error(_("'git multi-pack-index expire' failed"));
+
+	return 0;
+}
+
+static int multi_pack_index_repack(struct maintenance_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	strvec_push(&child.args, "--batch-size=0");
+
+	close_object_store(the_repository->objects);
+
+	if (run_command(&child))
+		return error(_("'git multi-pack-index repack' failed"));
+
+	return 0;
+}
+
+static int maintenance_task_incremental_repack(struct maintenance_opts *opts)
+{
+	prepare_repo_settings(the_repository);
+	if (!the_repository->settings.core_multi_pack_index) {
+		warning(_("skipping incremental-repack task because core.multiPackIndex is disabled"));
+		return 0;
+	}
+
+	if (multi_pack_index_write(opts))
+		return 1;
+	if (multi_pack_index_verify(opts, "after initial write"))
+		return rewrite_multi_pack_index(opts);
+	if (multi_pack_index_expire(opts))
+		return 1;
+	if (multi_pack_index_verify(opts, "after expire step"))
+		return !!rewrite_multi_pack_index(opts);
+	if (multi_pack_index_repack(opts))
+		return 1;
+	if (multi_pack_index_verify(opts, "after repack step"))
+		return !!rewrite_multi_pack_index(opts);
+	return 0;
+}
+
 typedef int maintenance_task_fn(struct maintenance_opts *opts);
 
 /*
@@ -1049,6 +1157,7 @@ struct maintenance_task {
 enum maintenance_task_label {
 	TASK_PREFETCH,
 	TASK_LOOSE_OBJECTS,
+	TASK_INCREMENTAL_REPACK,
 	TASK_GC,
 	TASK_COMMIT_GRAPH,
 
@@ -1066,6 +1175,10 @@ static struct maintenance_task tasks[] = {
 		maintenance_task_loose_objects,
 		loose_object_auto_condition,
 	},
+	[TASK_INCREMENTAL_REPACK] = {
+		"incremental-repack",
+		maintenance_task_incremental_repack,
+	},
 	[TASK_GC] = {
 		"gc",
 		maintenance_task_gc,
diff --git a/midx.c b/midx.c
index aa37d5da86..66d7053d83 100644
--- a/midx.c
+++ b/midx.c
@@ -37,7 +37,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index b18cf53bc4..baeecc70c9 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 12f41dfc18..26f224b0e3 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,7 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+GIT_TEST_MULTI_PACK_INDEX=0
 objdir=.git/objects
 
 midx_read_expect () {
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index efda1cf69b..dde28cf837 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -5,6 +5,7 @@ test_description='git maintenance builtin'
 . ./test-lib.sh
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
 
 test_expect_success 'help text' '
 	test_expect_code 129 git maintenance -h 2>err &&
@@ -150,4 +151,41 @@ test_expect_success 'maintenance.loose-objects.auto' '
 	done
 '
 
+test_expect_success 'incremental-repack task' '
+	packDir=.git/objects/pack &&
+	for i in $(test_seq 1 5)
+	do
+		test_commit $i || return 1
+	done &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
+	test_tick &&
+	git pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+	test_tick &&
+	git pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+	rm -f $packDir/pack-* &&
+	rm -f $packDir/loose-* &&
+	ls $packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git maintenance run --task=incremental-repack &&
+	ls $packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git maintenance run --task=incremental-repack &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


  parent reply	other threads:[~2020-08-18 14:26 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-06 16:30 [PATCH 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-13  0:03     ` Junio C Hamano
2020-08-13  1:45       ` Jonathan Nieder
2020-08-13  4:37       ` [PATCH v3] " Junio C Hamano
2020-08-14  1:13         ` Derrick Stolee
2020-08-14  1:32           ` Junio C Hamano
2020-08-06 16:30 ` [PATCH 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:28     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:46     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-06 17:02   ` Son Luong Ngoc
2020-08-06 18:13     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` Derrick Stolee via GitGitGadget [this message]
2020-08-18 14:25   ` [PATCH v2 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-25 18:36   ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-22 23:05       ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-22 23:09       ` Jonathan Tan
2020-09-24 13:45         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-22 23:15       ` Jonathan Tan
2020-09-24 13:51         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-22 23:16       ` Jonathan Tan
2020-09-24 13:53         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-22 23:26       ` Jonathan Tan
2020-09-24 14:05         ` Derrick Stolee
2020-09-24 22:01           ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-09-22 23:52       ` Jonathan Tan
2020-08-25 20:59     ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Junio C Hamano
2020-08-26 15:15     ` Son Luong Ngoc
2020-08-26 16:21       ` Derrick Stolee
2020-09-25 12:33     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-25 18:00         ` Junio C Hamano
2020-09-25 18:43           ` Derrick Stolee
2020-09-25 12:33       ` [PATCH v4 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8d956dad6b3a81d0f1b63cbd48f36a5e1195ae8.1597760730.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=phillip.wood123@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=sluongng@gmail.com \
    --cc=steadmon@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).