git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 05/15] run-job: implement pack-files job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-05-27 22:17   ` Josh Steadmon
  0 siblings, 1 reply; 4+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'pack-files' job runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the pack-files job will delete these repacked
   pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3392 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 18 ++++++-
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
 midx.c                        |  2 +-
 midx.h                        |  1 +
 t/t7900-run-job.sh            | 39 +++++++++++++++
 5 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 43ca1160b5a..108ed25b8bd 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files)'
 
 
 DESCRIPTION
@@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
 objects to prevent the job from taking too long on a repository with
 many loose objects.
 
+'pack-files'::
+
+The `pack-files` job incrementally repacks the object directory using
+the `multi-pack-index` feature. In order to prevent race conditions with
+concurrent Git commands, it follows a two-step process. First, it
+deletes any pack-files included in the `multi-pack-index` where none of
+the objects in the `multi-pack-index` reference those pack-files; this
+only happens if all objects in the pack-file are also stored in a newer
+pack-file. Second, it selects a group of pack-files whose "expected
+size" is below the batch size until the group has total expected size at
+least the batch size; see the `--batch-size` option for the `repack`
+subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
+zero, which is a special case that attempts to repack all pack-files
+into a single pack-file.
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/run-job.c b/builtin/run-job.c
index cecf9058c51..d3543f7ccb9 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -1,13 +1,14 @@
 #include "builtin.h"
 #include "config.h"
 #include "commit-graph.h"
+#include "midx.h"
 #include "object-store.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
 	NULL
 };
 
@@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
 	return prune_packed() || pack_loose();
 }
 
+static int multi_pack_index_write(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "write",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int rewrite_multi_pack_index(void)
+{
+	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	if (multi_pack_index_write()) {
+		error(_("failed to rewrite multi-pack-index"));
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_verify(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "verify",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_expire(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "expire",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_repack(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "repack",
+			 "--no-progress", "--batch-size=0", NULL);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	if (result && multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after repack"));
+		result = rewrite_multi_pack_index();
+	}
+
+	return result;
+}
+
+static int run_pack_files_job(void)
+{
+	if (multi_pack_index_write()) {
+		error(_("failed to write multi-pack-index"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after initial write"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_expire()) {
+		error(_("multi-pack-index expire failed"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after expire"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_repack()) {
+		error(_("multi-pack-index repack failed"));
+		return 1;
+	}
+
+	return 0;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 			return run_fetch_job();
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
+		if (!strcmp(argv[0], "pack-files"))
+			return run_pack_files_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/midx.c b/midx.c
index 1527e464a7b..0f0d0a38812 100644
--- a/midx.c
+++ b/midx.c
@@ -36,7 +36,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index e6fa356b5ca..cf2c09dffc2 100644
--- a/midx.h
+++ b/midx.h
@@ -39,6 +39,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 41da083257b..416ba04989d 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
 '
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
 
 . ./test-lib.sh
 
@@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'pack-files job' '
+	packDir=.git/objects/pack &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+
+	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+
+	rm -f client/$packDir/pack-* &&
+	rm -f client/$packDir/loose-* &&
+
+	ls client/$packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git -C client run-job pack-files &&
+	ls client/$packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git -C client run-job pack-files &&
+	ls client/.git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 05/15] run-job: implement pack-files job
@ 2020-05-02  7:56 Son Luong Ngoc
  2020-05-04 14:34 ` Derrick Stolee
  0 siblings, 1 reply; 4+ messages in thread
From: Son Luong Ngoc @ 2020-05-02  7:56 UTC (permalink / raw)
  To: gitgitgadget; +Cc: dstolee, git, jrnieder, peff, stolee

Hi Derrick,

Sorry for another late reply on this RFC.
This time is a question on the multi-pack-index repack process.

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c

...

> +static int multi_pack_index_repack(void)
> +{
> + int result;
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + argv_array_pushl(&cmd, "multi-pack-index", "repack",
> + "--no-progress", "--batch-size=0", NULL);
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + if (result && multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after repack"));
> + result = rewrite_multi_pack_index();
> + }

Its a bit inconsistent where write() and expire() did not include
verify() within them
but repack does. What make repack() different?

> +
> + return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> + if (multi_pack_index_write()) {
> + error(_("failed to write multi-pack-index"));
> + return 1;
> + }
> +
> + if (multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after initial write"));
> + return rewrite_multi_pack_index();
> + }
> +
> + if (multi_pack_index_expire()) {
> + error(_("multi-pack-index expire failed"));
> + return 1;
> + }

Why expire *before* repack and not after?

I thought `core.multiPackIndex=true` would prevent old pack files from
being used thus expiring immediately after repack is safe? (on that
note, are users
required to set this config prior to running the job?)

If expiring immediately after repack()+verify() is not safe, then should
we have a minimum allowed interval set? (although I would preferred to
make expire() safe)

> +
> + if (multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after expire"));
> + return rewrite_multi_pack_index();
> + }
> +
> + if (multi_pack_index_repack()) {
> + error(_("multi-pack-index repack failed"));
> + return 1;
> + }

Again, I just think the decision to include verify() inside repack()
made this part a bit inconsistent.

> +
> + return 0;
> +}
> +

Cheers,
Son Luong

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 05/15] run-job: implement pack-files job
  2020-05-02  7:56 [PATCH 05/15] run-job: implement pack-files job Son Luong Ngoc
@ 2020-05-04 14:34 ` Derrick Stolee
  0 siblings, 0 replies; 4+ messages in thread
From: Derrick Stolee @ 2020-05-04 14:34 UTC (permalink / raw)
  To: Son Luong Ngoc, gitgitgadget; +Cc: dstolee, git, jrnieder, peff

On 5/2/2020 3:56 AM, Son Luong Ngoc wrote:
> Hi Derrick,
> 
> Sorry for another late reply on this RFC.
> This time is a question on the multi-pack-index repack process.
> 
> Why expire *before* repack and not after?
> 
> I thought `core.multiPackIndex=true` would prevent old pack files from
> being used thus expiring immediately after repack is safe? (on that
> note, are users
> required to set this config prior to running the job?)
> 
> If expiring immediately after repack()+verify() is not safe, then should
> we have a minimum allowed interval set? (although I would preferred to
> make expire() safe)

Suppose we run "repack" and then "expire".

Suppose that there is a concurrent Git process that has a read handle
on the multi-pack-index from before the repack. This multi-pack-index
has a pack-file that was repacked by the "repack" command, and hence
was expired and deleted by the "expire" command (because all of its
objects are in the _new_ pack). However, when the Git process with the
old multi-pack-index reads an object pointing to that pack-file, it
finds that the pack does not exist when it tries to load it.

Git is usually pretty resilient to these kinds of things, but it seemed
prudent to be extra careful here. By spacing out these jobs across a
time where we don't expect concurrent Git processes to hold a read handle
on that old multi-pack-index (say, 24 hours) then this method is safe.

In fact, Scalar ensures that no concurrent Git process is running at the
start of the job [1], which means that no Git processes are _still_ running
since the last job (but this does not stop concurrent processes from starting
after that point and before the 'expire' command).

[1] https://github.com/microsoft/scalar/blob/489764b815dfea32bec5cd4228f2157766012784/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L106-L111

>> +
>> + if (multi_pack_index_verify()) {
>> + warning(_("multi-pack-index verify failed after expire"));
>> + return rewrite_multi_pack_index();
>> + }
>> +
>> + if (multi_pack_index_repack()) {
>> + error(_("multi-pack-index repack failed"));
>> + return 1;
>> + }
> 
> Again, I just think the decision to include verify() inside repack()
> made this part a bit inconsistent.

You're right. It is a bit inconsistent. That should be fixed in the
next version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 05/15] run-job: implement pack-files job
  2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
@ 2020-05-27 22:17   ` Josh Steadmon
  0 siblings, 0 replies; 4+ messages in thread
From: Josh Steadmon @ 2020-05-27 22:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The previous change cleaned up loose objects using the
> 'loose-objects' that can be run safely in the background. Add a
> similar job that performs similar cleanups for pack-files.
> 
> One issue with running 'git repack' is that it is designed to
> repack all pack-files into a single pack-file. While this is the
> most space-efficient way to store object data, it is not time or
> memory efficient. This becomes extremely important if the repo is
> so large that a user struggles to store two copies of the pack on
> their disk.
> 
> Instead, perform an "incremental" repack by collecting a few small
> pack-files into a new pack-file. The multi-pack-index facilitates
> this process ever since 'git multi-pack-index expire' was added in
> 19575c7 (multi-pack-index: implement 'expire' subcommand,
> 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
> (midx: implement midx_repack(), 2019-06-10).
> 
> The 'pack-files' job runs the following steps:
> 
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
> 
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."
> 
> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the

Typo: updaes -> updates


>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git

Is it still safe for concurrent processes if the repo did not have a
multi-pack-index when the first process started?


>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)
> 
> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set

Typo: bacth-size -> batch-size


>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

Typo: approximats -> approximates


>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.
> 
>    The next run of the pack-files job will delete these repacked
>    pack-files during the 'expire' step.
> 
>    In this version, the batch size is set to "0" which ignores the
>    size restrictions when selecting the pack-files. It instead
>    selects all pack-files and repacks all packed objects into a
>    single pack-file. This will be updated in the next change, but
>    it requires doing some calculations that are better isolated to
>    a separate change.
> 
> Each of the above steps update the multi-pack-index file. After
> each step, we verify the new multi-pack-index. If the new
> multi-pack-index is corrupt, then delete the multi-pack-index,
> rewrite it from scratch, and stop doing the later steps of the
> job. This is intended to be an extra-safe check without leaving
> a repo with many pack-files without a multi-pack-index.
> 
> These steps are based on a similar background maintenance step in
> Scalar (and VFS for Git) [1]. This was incredibly effective for
> users of the Windows OS repository. After using the same VFS for Git
> repository for over a year, some users had _thousands_ of pack-files
> that combined to up to 250 GB of data. We noticed a few users were
> running into the open file descriptor limits (due in part to a bug
> in the multi-pack-index fixed by af96fe3392 (midx: add packs to
> packed_git linked list, 2019-04-29).
> 
> These pack-files were mostly small since they contained the commits
> and trees that were pushed to the origin in a given hour. The GVFS
> protocol includes a "prefetch" step that asks for pre-computed pack-
> files containing commits and trees by timestamp. These pack-files
> were grouped into "daily" pack-files once a day for up to 30 days.
> If a user did not request prefetch packs for over 30 days, then they
> would get the entire history of commits and trees in a new, large
> pack-file. This led to a large number of pack-files that had poor
> delta compression.
> 
> By running this pack-file maintenance step once per day, these repos
> with thousands of packs spanning 200+ GB dropped to dozens of pack-
> files spanning 30-50 GB. This was done all without removing objects
> from the system and using a constant batch size of two gigabytes.
> Once the work was done to reduce the pack-files to small sizes, the
> batch size of two gigabytes means that not every run triggers a
> repack operation, so the following run will not expire a pack-file.
> This has kept these repos in a "clean" state.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 18 ++++++-
>  builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
>  midx.c                        |  2 +-
>  midx.h                        |  1 +
>  t/t7900-run-job.sh            | 39 +++++++++++++++
>  5 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 43ca1160b5a..108ed25b8bd 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job (commit-graph|fetch|loose-objects)'
> +'git run-job (commit-graph|fetch|loose-objects|pack-files)'
>  
>  
>  DESCRIPTION
> @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
>  objects to prevent the job from taking too long on a repository with
>  many loose objects.
>  
> +'pack-files'::
> +
> +The `pack-files` job incrementally repacks the object directory using
> +the `multi-pack-index` feature. In order to prevent race conditions with
> +concurrent Git commands, it follows a two-step process. First, it
> +deletes any pack-files included in the `multi-pack-index` where none of
> +the objects in the `multi-pack-index` reference those pack-files; this
> +only happens if all objects in the pack-file are also stored in a newer
> +pack-file. Second, it selects a group of pack-files whose "expected
> +size" is below the batch size until the group has total expected size at
> +least the batch size; see the `--batch-size` option for the `repack`
> +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
> +zero, which is a special case that attempts to repack all pack-files
> +into a single pack-file.
> +
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,13 +1,14 @@
>  #include "builtin.h"
>  #include "config.h"
>  #include "commit-graph.h"
> +#include "midx.h"
>  #include "object-store.h"
>  #include "parse-options.h"
>  #include "repository.h"
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job (commit-graph|fetch|loose-objects)"),
> +	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
>  	NULL
>  };
>  
> @@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
>  	return prune_packed() || pack_loose();
>  }
>  
> +static int multi_pack_index_write(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "write",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int rewrite_multi_pack_index(void)
> +{
> +	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
> +
> +	unlink(midx_name);
> +	free(midx_name);
> +
> +	if (multi_pack_index_write()) {
> +		error(_("failed to rewrite multi-pack-index"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int multi_pack_index_verify(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "verify",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_expire(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "expire",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_repack(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "repack",
> +			 "--no-progress", "--batch-size=0", NULL);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	if (result && multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after repack"));
> +		result = rewrite_multi_pack_index();
> +	}
> +
> +	return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			return run_fetch_job();
>  		if (!strcmp(argv[0], "loose-objects"))
>  			return run_loose_objects_job();
> +		if (!strcmp(argv[0], "pack-files"))
> +			return run_pack_files_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/midx.c b/midx.c
> index 1527e464a7b..0f0d0a38812 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -36,7 +36,7 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
> diff --git a/midx.h b/midx.h
> index e6fa356b5ca..cf2c09dffc2 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -39,6 +39,7 @@ struct multi_pack_index {
>  
>  #define MIDX_PROGRESS     (1 << 0)
>  
> +char *get_midx_filename(const char *object_dir);
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 41da083257b..416ba04989d 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
>  '
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
>  
>  . ./test-lib.sh
>  
> @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
>  	test_cmp packs-between packs-after
>  '
>  
> +test_expect_success 'pack-files job' '
> +	packDir=.git/objects/pack &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +
> +	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +
> +	rm -f client/$packDir/pack-* &&
> +	rm -f client/$packDir/loose-* &&
> +
> +	ls client/$packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git -C client run-job pack-files &&
> +	ls client/$packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git -C client run-job pack-files &&
> +	ls client/.git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-27 22:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-02  7:56 [PATCH 05/15] run-job: implement pack-files job Son Luong Ngoc
2020-05-04 14:34 ` Derrick Stolee
  -- strict thread matches above, loose matches on Subject: below --
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17   ` Josh Steadmon

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).