Re: [PATCH 06/15] run-job: auto-size or use custom pack-files batch

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Re: [PATCH 06/15] run-job: auto-size or use custom pack-files batch
@ 2020-04-30 16:48 Son Luong Ngoc
  2020-04-30 20:13 ` Derrick Stolee
  0 siblings, 1 reply; 3+ messages in thread
From: Son Luong Ngoc @ 2020-04-30 16:48 UTC (permalink / raw)
  To: gitgitgadget; +Cc: dstolee, git, jrnieder, peff, stolee

Hi Derrick,

I have been reviewing these jobs' mechanics closely and have some questions:

> The dynamic default size is computed with this idea in mind for
> a client repository that was cloned from a very large remote: there
> is likely one "big" pack-file that was created at clone time. Thus,
> do not try repacking it as it is likely packed efficiently by the
> server. Instead, try packing the other pack-files into a single
> pack-file.
>
> The size is then computed as follows:
>
> batch size = total size - max pack size

Could you please elaborate why is this the best value?
In practice I have been testing this out with the following

> % cat debug.sh
> #!/bin/bash
>
> temp=$(du -cb .git/objects/pack/*.pack)
>
> total_size=$(echo "$temp" | grep total | awk '{print $1}')
> echo total_size
> echo $total_size
>
> biggest_pack=$(echo "$temp" | sort -n | tail -2 | head -1 | awk '{print $1}')
> echo biggest pack
> echo $biggest_pack
>
> batch_size=$(expr $total_size - $biggest_pack)
> echo batch size
> echo $batch_size

If you were to run

> git multi-pack-index repack --batch-size=$(./debug.sh | tail -1)

then nothing would be repack.

Instead, I have had a lot more success with the following

> # Get the 2nd biggest pack size (in bytes) + 1
> $(du -b .git/objects/pack/*pack | sort -n | tail -2 | head -1 | awk '{print $1}') + 1

I think you also used this approach in t5319 when you used the 3rd
biggest pack size

> test_expect_success 'repack creates a new pack' '
> (
> cd dup &&
> ls .git/objects/pack/*idx >idx-list &&
> test_line_count = 5 idx-list &&
> THIRD_SMALLEST_SIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | head -n 3 | tail -n 1) &&
> BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
> git multi-pack-index repack --batch-size=$BATCH_SIZE &&
> ls .git/objects/pack/*idx >idx-list &&
> test_line_count = 6 idx-list &&
> test-tool read-midx .git/objects | grep idx >midx-list &&
> test_line_count = 6 midx-list
> )
> '

Looking forward to a re-roll of this RFC.

Cheers,
Son Luong.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 06/15] run-job: auto-size or use custom pack-files batch
  2020-04-30 16:48 [PATCH 06/15] run-job: auto-size or use custom pack-files batch Son Luong Ngoc
@ 2020-04-30 20:13 ` Derrick Stolee
  0 siblings, 0 replies; 3+ messages in thread
From: Derrick Stolee @ 2020-04-30 20:13 UTC (permalink / raw)
  To: Son Luong Ngoc, gitgitgadget; +Cc: dstolee, git, jrnieder, peff

On 4/30/2020 12:48 PM, Son Luong Ngoc wrote:
> Hi Derrick,
> 
> I have been reviewing these jobs' mechanics closely and have some questions:
> 
>> The dynamic default size is computed with this idea in mind for
>> a client repository that was cloned from a very large remote: there
>> is likely one "big" pack-file that was created at clone time. Thus,
>> do not try repacking it as it is likely packed efficiently by the
>> server. Instead, try packing the other pack-files into a single
>> pack-file.
>>
>> The size is then computed as follows:
>>
>> batch size = total size - max pack size
> 
> Could you please elaborate why is this the best value?

The intention was to repack everything _except_ the biggest pack,
but clearly that doesn't always work. There is some logic to "guess"
the size of the resulting pack that doesn't always reach the total
batch size, so nothing happens. More investigation is required here.

> In practice I have been testing this out with the following
> 
>> % cat debug.sh
>> #!/bin/bash
>>
>> temp=$(du -cb .git/objects/pack/*.pack)
>>
>> total_size=$(echo "$temp" | grep total | awk '{print $1}')
>> echo total_size
>> echo $total_size
>>
>> biggest_pack=$(echo "$temp" | sort -n | tail -2 | head -1 | awk '{print $1}')
>> echo biggest pack
>> echo $biggest_pack
>>
>> batch_size=$(expr $total_size - $biggest_pack)
>> echo batch size
>> echo $batch_size
> 
> If you were to run
> 
>> git multi-pack-index repack --batch-size=$(./debug.sh | tail -1)
> 
> then nothing would be repack.> 
> Instead, I have had a lot more success with the following
> 
>> # Get the 2nd biggest pack size (in bytes) + 1
>> $(du -b .git/objects/pack/*pack | sort -n | tail -2 | head -1 | awk '{print $1}') + 1
> 
> I think you also used this approach in t5319 when you used the 3rd
> biggest pack size

The "second biggest pack" is an interesting approach. At first glance it
seems like we will stabilize with one big pack and many similarly-sized
packs. However, even a small deviation in size is inevitable and will
cause two or more packs to combine and create a "new second biggest"
pack.

> Looking forward to a re-roll of this RFC.

I do plan to submit a new version of the RFC, but it will look quite
different based on the feedback so far. I'm still digesting that
feedback and will take another attempt at it after I wrap up some other
items that have my attention currently.

Thanks!
-Stolee



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 00/15] [RFC] Maintenance jobs and job runner
@ 2020-04-03 20:47 Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
  0 siblings, 1 reply; 3+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee

Background maintenance. I've blogged about it [1]. I've talked about it [2].
I brought it up at the contributor summit [3]. I think it's important. It's
a critical feature in Scalar and VFS for Git. It's the main reason we
created the "scalar register" feature for existing repos (so the Scalar
background maintenance would apply to a "vanilla" Git repo).

[1] https://devblogs.microsoft.com/devops/introducing-scalar/[2] 
https://stolee.dev/docs/git-merge-2020.pdf[3] 
https://lore.kernel.org/git/35FDF767-CE0C-4C51-88A8-12965CD2D4FF@jramsay.com.au/

This RFC does the "maintenance" half of "background maintenance". It creates
two new builtins to assist users in scheduling their own background
maintenance:

 * git run-job <job-name>: This builtin will run a single instance of a
   maintenance job.
   
   
 * git job-runner [--repo=<path>]: This builtin will run an infinite loop
   that executes git run-job as a subcommand.
   
   

I believe that I could replace most maintenance steps in Scalar with the git
run-job command and all that we would lose is some logging. (The only one
that would remain is the one that sets recommended config settings, but that
could be changed to a one-time operation at service start-up.) I haven't
tested this yet, but I plan to before sending a non-RFC option.

Of course, just because we think these maintenance operations work for our
scenario does not mean that they immediately work as the "canonical"
maintenance activities. I attempt to demonstrate the value of these jobs as
they are implemented. My perspective is skewed towards very large
repositories (2000+ full-time contributors), but even medium-size
repositories (100-200 full-time contributors) can benefit from these jobs.

I've tried to split the series into logical commits. This includes each
specific maintenance job being completely described in its own commit (plus
maybe an extra one to allow a custom option).

Most commit messages have "RFC QUESTION(S)" for things I was unsure about.

I'm currently testing this locally by setting job.repo in my global config
to be a few important repos on my Linux VM then running 
GIT_TRACE2_PERF=<path> git job-runner --daemonize to launch a background
process that logs the subcommands to <path>.

Here I will call out things that could definitely be improved:

 1. The git job-runner process is not tested AT ALL. It's a bit tricky to
    test it since it shouldn't exit when things work as expected. I expect
    to create a --no-loop option to stop after one iteration of the job
    loop. But I would like a bit more feedback on the concept before jumping
    into those tests. (I do test the git run-job builtin for each job, but
    not the customization through arguments and config.) Perhaps we could do
    some funny business about mocking git using --exec-path to check that it
    is calling git run-job in the correct order (and more importantly, not 
    calling it when certain config settings are present).
    
    
 2. The --daemonize option at the end is shamelessly stolen from git gc
    --daemonize and git daemon, but has limited abilities on some platforms
    (I've tested on Linux and macOS). I have not done my research on how far
    this gets us to allowing users to launch this at startup or something. 
    
    
 3. As I said, this is the "maintenance" "half" of "background maintenance".
    The "background" part is harder in my opinion because it involves
    creating platform-specific ways to consistently launch background
    processes. For example, Unix systems should have one way to service
    start X while Windows has another. macOS has launchd to launch processes
    as users log in, which should be a good way forward. Scalar implements a
    Windows Service that runs as root but impersonates the latest user to
    log in, and it implements a macOS "service" that is running only with
    the current user. I expect to need to create these services myself as a
    follow-up, but I lack the expertise to do it (currently). If someone
    else has experience creating these things and wants to take over or
    advise that half then I would appreciate the help!
    
    
 4. I noticed late in the RFC process that I'm not clearing my argv_arrays
    carefully in the job-runner. This will need to be rectified and
    carefully checked with valgrind before merging this code. While it leaks
    memory very slowly, it will be important that we don't leak any memory
    at all since this is a long-lived process. There's also some places
    where I was careful to not include too much of libgit.a to help keep the
    memory footprint low.
    
    

Thanks, -Stolee

Derrick Stolee (15):
  run-job: create barebones builtin
  run-job: implement commit-graph job
  run-job: implement fetch job
  run-job: implement loose-objects job
  run-job: implement pack-files job
  run-job: auto-size or use custom pack-files batch
  config: add job.pack-files.batchSize option
  job-runner: create builtin for job loop
  job-runner: load repos from config by default
  job-runner: use config to limit job frequency
  job-runner: use config for loop interval
  job-runner: add --interval=<span> option
  job-runner: skip a job if job.<job-name>.enabled is false
  job-runner: add --daemonize option
  runjob: customize the loose-objects batch size

 .gitignore                       |   2 +
 Documentation/config.txt         |   2 +
 Documentation/config/job.txt     |  37 +++
 Documentation/git-job-runner.txt |  63 +++++
 Documentation/git-run-job.txt    | 102 +++++++
 Makefile                         |   2 +
 builtin.h                        |   2 +
 builtin/job-runner.c             | 347 +++++++++++++++++++++++
 builtin/run-job.c                | 458 +++++++++++++++++++++++++++++++
 cache.h                          |   4 +-
 command-list.txt                 |   2 +
 commit-graph.c                   |   2 +-
 commit-graph.h                   |   1 +
 daemon.h                         |   7 +
 git.c                            |   2 +
 midx.c                           |   2 +-
 midx.h                           |   1 +
 t/t7900-run-job.sh               | 137 +++++++++
 18 files changed, 1168 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/config/job.txt
 create mode 100644 Documentation/git-job-runner.txt
 create mode 100644 Documentation/git-run-job.txt
 create mode 100644 builtin/job-runner.c
 create mode 100644 builtin/run-job.c
 create mode 100644 daemon.h
 create mode 100755 t/t7900-run-job.sh


base-commit: 9fadedd637b312089337d73c3ed8447e9f0aa775
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-597%2Fderrickstolee%2Fjobs-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-597/derrickstolee/jobs-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/597
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 06/15] run-job: auto-size or use custom pack-files batch
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  0 siblings, 0 replies; 3+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'pack-files' job, we use the --batch-size
option in 'git multi-pack-index repack'. The initial setting used
--batch-size=0 to repack everything into a single pack-file. This
is not sustaintable for a large repository. The amount of work
required is also likely to use too many system resources for a
background job.

Update the 'git run-job pack-files' command by allowing a direct
--batch-size option that can change the value provided. Update the
default of "0" to a computed size based on the existing pack-files.
While computing that new size, count the existing pack-files and
skip the repack step if there are at most two pack-files.

The dynamic default size is computed with this idea in mind for
a client repository that was cloned from a very large remote: there
is likely one "big" pack-file that was created at clone time. Thus,
do not try repacking it as it is likely packed efficiently by the
server. Instead, try packing the other pack-files into a single
pack-file.

The size is then computed as follows:

	batch size = total size - max pack size

The batch size is then also limited to be at most two gigabytes.
This serves two purposes. First, having a limit prevents doing too
much work when the repository is extremely large. Pack-files
larger than two gigabytes are likely to either contain large blobs
or have been carefully repacked by a previous repack operation.
Second, two gigabytes is the size limit for a signed 32-bit int.
It's a good limit to consider, and to keep it far away from the
unsigned 32-bit int limit. This limit comes to mind because on
Windows an "unsigned long" is 32 bits and OPT_MAGNITUDE() uses
unsigned longs for its parsing logic.

This calculation mimics a similar calculation in Scalar [1],
except for a 3% drop in the calculated batch size due to the
round-off error that can happen with the "expected size"
calculation for a pack-file.

[1] https://github.com/microsoft/scalar/blob/616e9b16dd120b8fdb652d6d5a55618c731a8aea/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L143

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 13 +++--
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++---
 t/t7900-run-job.sh            |  6 ++-
 3 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 108ed25b8bd..cdd6417f7c9 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects|pack-files)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]'
 
 
 DESCRIPTION
@@ -82,9 +82,14 @@ only happens if all objects in the pack-file are also stored in a newer
 pack-file. Second, it selects a group of pack-files whose "expected
 size" is below the batch size until the group has total expected size at
 least the batch size; see the `--batch-size` option for the `repack`
-subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
-zero, which is a special case that attempts to repack all pack-files
-into a single pack-file.
+subcommand in linkgit:git-multi-pack-index[1].
++
+The default batch size is computed to optimize for having a single large
+pack-file and many small pack-files. When there are two or fewer
+pack-files, the job does not attempt to repack. Otherwise, the batch
+size is the sum of all pack-file sizes minus the largest pack-file size.
+The batch size is capped at two gigabytes. This intends to pack all
+small pack-files into a single pack-file.
 
 
 GIT
diff --git a/builtin/run-job.c b/builtin/run-job.c
index d3543f7ccb9..2ccc3bbae2d 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -3,12 +3,18 @@
 #include "commit-graph.h"
 #include "midx.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]"),
+	NULL
+};
+
+static char const * const builtin_run_job_pack_file_usage[] = {
+	N_("git run-job pack-files [--batch-size=<size>]"),
 	NULL
 };
 
@@ -278,15 +284,74 @@ static int multi_pack_index_expire(void)
 	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 }
 
-static int multi_pack_index_repack(void)
+#define TWO_GIGABYTES (2147483647)
+
+static off_t get_auto_pack_size(int *count)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly. Find the sum of the sizes
+	 * other than the largest pack-file, then use
+	 * that as the batch size.
+	 */
+	off_t total_size = 0;
+	off_t max_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+
+	*count = 0;
+
+	reprepare_packed_git(the_repository);
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		(*count)++;
+		total_size += p->pack_size;
+
+		if (p->pack_size > max_size)
+			max_size = p->pack_size;
+	}
+
+	result_size = total_size - max_size;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+static int multi_pack_index_repack(unsigned long batch_size)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+	int count;
+	off_t default_size = get_auto_pack_size(&count);
+
+	if (count <= 2)
+		return 0;
+
+	strbuf_addstr(&batch_arg, "--batch-size=");
+
+	if (batch_size != UNSET_BATCH_SIZE)
+		strbuf_addf(&batch_arg, "\"%"PRIuMAX"\"", (uintmax_t)batch_size);
+	else
+		strbuf_addf(&batch_arg, "%"PRIuMAX,
+			    (uintmax_t)default_size);
+
 	argv_array_pushl(&cmd, "multi-pack-index", "repack",
-			 "--no-progress", "--batch-size=0", NULL);
+			 "--no-progress", batch_arg.buf, NULL);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 
-	if (result && multi_pack_index_verify()) {
+	strbuf_release(&batch_arg);
+
+	/*
+	 * Verify here to avoid verifying again when there are two
+	 * or fewer pack-files.
+	 */
+	if (!result && multi_pack_index_verify()) {
 		warning(_("multi-pack-index verify failed after repack"));
 		result = rewrite_multi_pack_index();
 	}
@@ -294,8 +359,19 @@ static int multi_pack_index_repack(void)
 	return result;
 }
 
-static int run_pack_files_job(void)
+static int run_pack_files_job(int argc, const char **argv)
 {
+	static unsigned long batch_size = UNSET_BATCH_SIZE;
+	static struct option builtin_run_job_pack_file_options[] = {
+		OPT_MAGNITUDE(0, "batch-size", &batch_size,
+			      N_("specify a batch-size for the incremental repack")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_run_job_pack_file_options,
+			     builtin_run_job_pack_file_usage, 0);
+
 	if (multi_pack_index_write()) {
 		error(_("failed to write multi-pack-index"));
 		return 1;
@@ -316,7 +392,7 @@ static int run_pack_files_job(void)
 		return rewrite_multi_pack_index();
 	}
 
-	if (multi_pack_index_repack()) {
+	if (multi_pack_index_repack(batch_size)) {
 		error(_("multi-pack-index repack failed"));
 		return 1;
 	}
@@ -348,7 +424,7 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
 		if (!strcmp(argv[0], "pack-files"))
-			return run_pack_files_job();
+			return run_pack_files_job(argc, argv);
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 416ba04989d..2d9f6cdf328 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -128,8 +128,10 @@ test_expect_success 'pack-files job' '
 	# the job deletes the two old packs, and does not write
 	# a new one because only one pack remains.
 	git -C client run-job pack-files &&
-	ls client/.git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	ls client/$packDir/*.pack >packs-after &&
+	test_line_count = 2 packs-after &&
+	cat packs-after | grep "pack/test-1-" &&
+	cat packs-after | grep "pack/pack-"
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-30 20:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-30 16:48 [PATCH 06/15] run-job: auto-size or use custom pack-files batch Son Luong Ngoc
2020-04-30 20:13 ` Derrick Stolee
  -- strict thread matches above, loose matches on Subject: below --
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).