git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 00/15] [RFC] Maintenance jobs and job runner
@ 2020-04-03 20:47 Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
                   ` (15 more replies)
  0 siblings, 16 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:47 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee

Background maintenance. I've blogged about it [1]. I've talked about it [2].
I brought it up at the contributor summit [3]. I think it's important. It's
a critical feature in Scalar and VFS for Git. It's the main reason we
created the "scalar register" feature for existing repos (so the Scalar
background maintenance would apply to a "vanilla" Git repo).

[1] https://devblogs.microsoft.com/devops/introducing-scalar/[2] 
https://stolee.dev/docs/git-merge-2020.pdf[3] 
https://lore.kernel.org/git/35FDF767-CE0C-4C51-88A8-12965CD2D4FF@jramsay.com.au/

This RFC does the "maintenance" half of "background maintenance". It creates
two new builtins to assist users in scheduling their own background
maintenance:

 * git run-job <job-name>: This builtin will run a single instance of a
   maintenance job.
   
   
 * git job-runner [--repo=<path>]: This builtin will run an infinite loop
   that executes git run-job as a subcommand.
   
   

I believe that I could replace most maintenance steps in Scalar with the git
run-job command and all that we would lose is some logging. (The only one
that would remain is the one that sets recommended config settings, but that
could be changed to a one-time operation at service start-up.) I haven't
tested this yet, but I plan to before sending a non-RFC option.

Of course, just because we think these maintenance operations work for our
scenario does not mean that they immediately work as the "canonical"
maintenance activities. I attempt to demonstrate the value of these jobs as
they are implemented. My perspective is skewed towards very large
repositories (2000+ full-time contributors), but even medium-size
repositories (100-200 full-time contributors) can benefit from these jobs.

I've tried to split the series into logical commits. This includes each
specific maintenance job being completely described in its own commit (plus
maybe an extra one to allow a custom option).

Most commit messages have "RFC QUESTION(S)" for things I was unsure about.

I'm currently testing this locally by setting job.repo in my global config
to be a few important repos on my Linux VM then running 
GIT_TRACE2_PERF=<path> git job-runner --daemonize to launch a background
process that logs the subcommands to <path>.

Here I will call out things that could definitely be improved:

 1. The git job-runner process is not tested AT ALL. It's a bit tricky to
    test it since it shouldn't exit when things work as expected. I expect
    to create a --no-loop option to stop after one iteration of the job
    loop. But I would like a bit more feedback on the concept before jumping
    into those tests. (I do test the git run-job builtin for each job, but
    not the customization through arguments and config.) Perhaps we could do
    some funny business about mocking git using --exec-path to check that it
    is calling git run-job in the correct order (and more importantly, not 
    calling it when certain config settings are present).
    
    
 2. The --daemonize option at the end is shamelessly stolen from git gc
    --daemonize and git daemon, but has limited abilities on some platforms
    (I've tested on Linux and macOS). I have not done my research on how far
    this gets us to allowing users to launch this at startup or something. 
    
    
 3. As I said, this is the "maintenance" "half" of "background maintenance".
    The "background" part is harder in my opinion because it involves
    creating platform-specific ways to consistently launch background
    processes. For example, Unix systems should have one way to service
    start X while Windows has another. macOS has launchd to launch processes
    as users log in, which should be a good way forward. Scalar implements a
    Windows Service that runs as root but impersonates the latest user to
    log in, and it implements a macOS "service" that is running only with
    the current user. I expect to need to create these services myself as a
    follow-up, but I lack the expertise to do it (currently). If someone
    else has experience creating these things and wants to take over or
    advise that half then I would appreciate the help!
    
    
 4. I noticed late in the RFC process that I'm not clearing my argv_arrays
    carefully in the job-runner. This will need to be rectified and
    carefully checked with valgrind before merging this code. While it leaks
    memory very slowly, it will be important that we don't leak any memory
    at all since this is a long-lived process. There's also some places
    where I was careful to not include too much of libgit.a to help keep the
    memory footprint low.
    
    

Thanks, -Stolee

Derrick Stolee (15):
  run-job: create barebones builtin
  run-job: implement commit-graph job
  run-job: implement fetch job
  run-job: implement loose-objects job
  run-job: implement pack-files job
  run-job: auto-size or use custom pack-files batch
  config: add job.pack-files.batchSize option
  job-runner: create builtin for job loop
  job-runner: load repos from config by default
  job-runner: use config to limit job frequency
  job-runner: use config for loop interval
  job-runner: add --interval=<span> option
  job-runner: skip a job if job.<job-name>.enabled is false
  job-runner: add --daemonize option
  runjob: customize the loose-objects batch size

 .gitignore                       |   2 +
 Documentation/config.txt         |   2 +
 Documentation/config/job.txt     |  37 +++
 Documentation/git-job-runner.txt |  63 +++++
 Documentation/git-run-job.txt    | 102 +++++++
 Makefile                         |   2 +
 builtin.h                        |   2 +
 builtin/job-runner.c             | 347 +++++++++++++++++++++++
 builtin/run-job.c                | 458 +++++++++++++++++++++++++++++++
 cache.h                          |   4 +-
 command-list.txt                 |   2 +
 commit-graph.c                   |   2 +-
 commit-graph.h                   |   1 +
 daemon.h                         |   7 +
 git.c                            |   2 +
 midx.c                           |   2 +-
 midx.h                           |   1 +
 t/t7900-run-job.sh               | 137 +++++++++
 18 files changed, 1168 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/config/job.txt
 create mode 100644 Documentation/git-job-runner.txt
 create mode 100644 Documentation/git-run-job.txt
 create mode 100644 builtin/job-runner.c
 create mode 100644 builtin/run-job.c
 create mode 100644 daemon.h
 create mode 100755 t/t7900-run-job.sh


base-commit: 9fadedd637b312089337d73c3ed8447e9f0aa775
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-597%2Fderrickstolee%2Fjobs-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-597/derrickstolee/jobs-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/597
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/15] run-job: create barebones builtin
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 15:10   ` Phillip Wood
  2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git run-job' command will be used to execute a short-lived set
of maintenance activities by a background job manager. The intention
is to perform small batches of work that reduce the foreground time
taken by repository maintenance such as 'git gc --auto'.

This change does the absolute minimum to create the builtin and show
the usage output.

Provide an explicit warning that this command is experimental. The
set of jobs may change, and each job could alter its behavior in
future versions.

RFC QUESTION: This builtin is based on the background maintenance in
Scalar. Specifically, this builtin is based on the "scalar run <job>"
command [1] [2]. My default thought was to make this a "git run <job>"
command to maximize similarity. However, it seems like "git run" is
too generic. Or, am I being overly verbose for no reason?

[1] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#run-maintenance-in-the-foreground
[2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/RunVerb.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                    |  1 +
 Documentation/git-run-job.txt | 36 +++++++++++++++++++++++++++++++++++
 Makefile                      |  1 +
 builtin.h                     |  1 +
 builtin/run-job.c             | 28 +++++++++++++++++++++++++++
 command-list.txt              |  1 +
 git.c                         |  1 +
 t/t7900-run-job.sh            | 15 +++++++++++++++
 8 files changed, 84 insertions(+)
 create mode 100644 Documentation/git-run-job.txt
 create mode 100644 builtin/run-job.c
 create mode 100755 t/t7900-run-job.sh

diff --git a/.gitignore b/.gitignore
index 188bd1c3de1..5dea9d3b96b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -144,6 +144,7 @@
 /git-rev-parse
 /git-revert
 /git-rm
+/git-run-job
 /git-send-email
 /git-send-pack
 /git-serve
diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
new file mode 100644
index 00000000000..0627b3ed259
--- /dev/null
+++ b/Documentation/git-run-job.txt
@@ -0,0 +1,36 @@
+git-run-job(1)
+==============
+
+NAME
+----
+git-run-job - Run a maintenance job. Intended for background operation.
+
+
+SYNOPSIS
+--------
+[verse]
+'git run-job <job-name>'
+
+
+DESCRIPTION
+-----------
+
+Run a maintenance job on the current repository. This is available as a
+command for a few reasons. First, the background job feature can launch
+these commands on a schedule and each process will completely clear its
+memory when complete. Second, an expert user could create their own job
+schedule by running these jobs themselves.
+
+THIS COMMAND IS EXPERIMENTAL. THE SET OF AVAILABLE JOBS OR THEIR EXACT
+BEHAVIOR MAY BE ALTERED IN THE FUTURE.
+
+
+JOBS
+----
+
+TBD
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index ef1ff2228f0..f5f9c4d9e94 100644
--- a/Makefile
+++ b/Makefile
@@ -1125,6 +1125,7 @@ BUILTIN_OBJS += builtin/rev-list.o
 BUILTIN_OBJS += builtin/rev-parse.o
 BUILTIN_OBJS += builtin/revert.o
 BUILTIN_OBJS += builtin/rm.o
+BUILTIN_OBJS += builtin/run-job.o
 BUILTIN_OBJS += builtin/send-pack.o
 BUILTIN_OBJS += builtin/shortlog.o
 BUILTIN_OBJS += builtin/show-branch.o
diff --git a/builtin.h b/builtin.h
index 2b25a80cde3..3e0ddaaf67f 100644
--- a/builtin.h
+++ b/builtin.h
@@ -220,6 +220,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix);
 int cmd_rev_parse(int argc, const char **argv, const char *prefix);
 int cmd_revert(int argc, const char **argv, const char *prefix);
 int cmd_rm(int argc, const char **argv, const char *prefix);
+int cmd_run_job(int argc, const char **argv, const char *prefix);
 int cmd_send_pack(int argc, const char **argv, const char *prefix);
 int cmd_shortlog(int argc, const char **argv, const char *prefix);
 int cmd_show(int argc, const char **argv, const char *prefix);
diff --git a/builtin/run-job.c b/builtin/run-job.c
new file mode 100644
index 00000000000..2c78d053aa4
--- /dev/null
+++ b/builtin/run-job.c
@@ -0,0 +1,28 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_run_job_usage[] = {
+	N_("git run-job"),
+	NULL
+};
+
+int cmd_run_job(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_run_job_options[] = {
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_run_job_usage,
+				   builtin_run_job_options);
+
+	git_config(git_default_config, NULL);
+	argc = parse_options(argc, argv, prefix,
+			     builtin_run_job_options,
+			     builtin_run_job_usage,
+			     PARSE_OPT_KEEP_UNKNOWN);
+
+	usage_with_options(builtin_run_job_usage,
+			   builtin_run_job_options);
+}
diff --git a/command-list.txt b/command-list.txt
index 20878946558..1cd2b415e46 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -156,6 +156,7 @@ git-revert                              mainporcelain
 git-rev-list                            plumbinginterrogators
 git-rev-parse                           plumbinginterrogators
 git-rm                                  mainporcelain           worktree
+git-run-job                             plumbingmanipulators
 git-send-email                          foreignscminterface             complete
 git-send-pack                           synchingrepositories
 git-shell                               synchelpers
diff --git a/git.c b/git.c
index b07198fe036..db5a43c8687 100644
--- a/git.c
+++ b/git.c
@@ -566,6 +566,7 @@ static struct cmd_struct commands[] = {
 	{ "rev-parse", cmd_rev_parse, NO_PARSEOPT },
 	{ "revert", cmd_revert, RUN_SETUP | NEED_WORK_TREE },
 	{ "rm", cmd_rm, RUN_SETUP },
+	{ "run-job", cmd_run_job, RUN_SETUP },
 	{ "send-pack", cmd_send_pack, RUN_SETUP },
 	{ "shortlog", cmd_shortlog, RUN_SETUP_GENTLY | USE_PAGER },
 	{ "show", cmd_show, RUN_SETUP },
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
new file mode 100755
index 00000000000..1eac80b7ed3
--- /dev/null
+++ b/t/t7900-run-job.sh
@@ -0,0 +1,15 @@
+#!/bin/sh
+
+test_description='git run-job
+
+Testing the background jobs, in the foreground
+'
+
+. ./test-lib.sh
+
+test_expect_success 'help text' '
+	test_must_fail git run-job -h 2>err &&
+	test_i18ngrep "usage: git run-job" err
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 02/15] run-job: implement commit-graph job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-05-20 19:08   ` Josh Steadmon
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The first job to implement in the 'git run-job' command is the
'commit-graph' job. It is based on the sequence of events in the
'commit-graph' job in Scalar [1]. This sequence is as follows:

1. git commit-graph write --reachable --split
2. git commit-graph verify --shallow
3. If the verify succeeds, stop.
4. Delete the commit-graph-chain file.
5. git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes "most"
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

By using 'git commit-graph verify --shallow' we can ensure that
the file we just wrote is valid. This is an extra safety precaution
that is faster than our 'write' subcommand. In the rare situation
that the newest layer of the commit-graph is corrupt, we can "fix"
the corruption by deleting the commit-graph-chain file and rewrite
the full commit-graph as a new one-layer commit graph. This does
not completely prevent _that_ file from being corrupt, but it does
recompute the commit-graph by parsing commits from the object
database. In our use of this step in Scalar and VFS for Git, we
have only seen this issue arise because our microsoft/git fork
reverted 43d3561 ("commit-graph write: don't die if the existing
graph is corrupt" 2019-03-25) for a while to keep commit-graph
writes very fast. We dropped the revert when updating to v2.23.0.
The verify still has potential for catching corrupt data across
the layer boundary: if the new file has commit X with parent Y
in an old file but the commit ID for Y in the old file had a
bitswap, then we will notice that in the 'verify' command.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs

RFC QUESTIONS:

1. Are links like [1] helpful?

2. Can anyone think of a way to test the rewrite fallback?
   It requires corrupting the latest file between two steps of
   this one command, so that is a tricky spot to inject a fault.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 21 ++++++++++--
 builtin/run-job.c             | 60 ++++++++++++++++++++++++++++++++++-
 commit-graph.c                |  2 +-
 commit-graph.h                |  1 +
 t/t7900-run-job.sh            | 32 +++++++++++++++++++
 5 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 0627b3ed259..8bf2762d650 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job <job-name>'
+'git run-job commit-graph'
 
 
 DESCRIPTION
@@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
 JOBS
 ----
 
-TBD
+'commit-graph'::
+
+The `commit-graph` job updates the `commit-graph` files incrementally,
+then verifies that the written data is correct. If the new layer has an
+issue, then the chain file is removed and the `commit-graph` is
+rewritten from scratch.
++
+The verification only checks the top layer of the `commit-graph` chain.
+If the incremental write merged the new commits with at least one
+existing layer, then there is potential for on-disk corruption being
+carried forward into the new file. This will be noticed and the new
+commit-graph file will be clean as Git reparses the commit data from
+the object database.
++
+The incremental write is safe to run alongside concurrent Git processes
+since it will not expire `.graph` files that were in the previous
+`commit-graph-chain` file. They will be deleted by a later run based on
+the expiration delay.
 
 
 GIT
diff --git a/builtin/run-job.c b/builtin/run-job.c
index 2c78d053aa4..dd7709952d3 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -1,12 +1,65 @@
 #include "builtin.h"
 #include "config.h"
+#include "commit-graph.h"
+#include "object-store.h"
 #include "parse-options.h"
+#include "repository.h"
+#include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job"),
+	N_("git run-job commit-graph"),
 	NULL
 };
 
+static int run_write_commit_graph(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "commit-graph", "write",
+			 "--split", "--reachable", "--no-progress",
+			 NULL);
+
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int run_verify_commit_graph(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "commit-graph", "verify",
+			 "--shallow", "--no-progress", NULL);
+
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int run_commit_graph_job(void)
+{
+	char *chain_path;
+
+	if (run_write_commit_graph()) {
+		error(_("failed to write commit-graph"));
+		return 1;
+	}
+
+	if (!run_verify_commit_graph())
+		return 0;
+
+	warning(_("commit-graph verify caught error, rewriting"));
+
+	chain_path = get_chain_filename(the_repository->objects->odb);
+	if (unlink(chain_path)) {
+		UNLEAK(chain_path);
+		die(_("failed to remove commit-graph at %s"), chain_path);
+	}
+	free(chain_path);
+
+	if (!run_write_commit_graph())
+		return 0;
+
+	error(_("failed to rewrite commit-graph"));
+	return 1;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 			     builtin_run_job_usage,
 			     PARSE_OPT_KEEP_UNKNOWN);
 
+	if (argc > 0) {
+		if (!strcmp(argv[0], "commit-graph"))
+			return run_commit_graph_job();
+	}
+
 	usage_with_options(builtin_run_job_usage,
 			   builtin_run_job_options);
 }
diff --git a/commit-graph.c b/commit-graph.c
index f013a84e294..6867f92d04a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
 		       oid_hex);
 }
 
-static char *get_chain_filename(struct object_directory *odb)
+char *get_chain_filename(struct object_directory *odb)
 {
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
diff --git a/commit-graph.h b/commit-graph.h
index e87a6f63600..8b7bd5a6cb1 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -13,6 +13,7 @@
 struct commit;
 
 char *get_commit_graph_filename(struct object_directory *odb);
+char *get_chain_filename(struct object_directory *odb);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
 
 /*
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 1eac80b7ed3..18b9bd26b3a 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -5,6 +5,8 @@ test_description='git run-job
 Testing the background jobs, in the foreground
 '
 
+GIT_TEST_COMMIT_GRAPH=0
+
 . ./test-lib.sh
 
 test_expect_success 'help text' '
@@ -12,4 +14,34 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git run-job" err
 '
 
+test_expect_success 'commit-graph job' '
+	git init server &&
+	(
+		cd server &&
+		chain=.git/objects/info/commit-graphs/commit-graph-chain &&
+		git checkout -b master &&
+		for i in $(test_seq 1 15)
+		do
+			test_commit $i || return 1
+		done &&
+		git run-job commit-graph 2>../err &&
+		test_must_be_empty ../err &&
+		test_line_count = 1 $chain &&
+		for i in $(test_seq 16 19)
+		do
+			test_commit $i || return 1
+		done &&
+		git run-job commit-graph 2>../err &&
+		test_must_be_empty ../err &&
+		test_line_count = 2 $chain &&
+		for i in $(test_seq 20 23)
+		do
+			test_commit $i || return 1
+		done &&
+		git run-job commit-graph 2>../err &&
+		test_must_be_empty ../err &&
+		test_line_count = 1 $chain
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
                     ` (2 more replies)
  2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  15 siblings, 3 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.

3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/hidden/<remote> refs that no
   longer appear on the remote.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

RFC QUESTIONS:

1. One downside of the refs/hidden pattern is that 'git log' will
   decorate commits with twice as many refs if they appear at a
   remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
   there an easy way to exclude a refspace from decorations? Should
   we make refs/hidden/* a "special" refspace that is excluded from
   decorations?

2. This feature is designed for a desktop machine or equivalent
   that has a permanent wired network connection, and the machine
   stays on while the user is not present. For things like laptops
   with inconsistent WiFi connections (that may be metered) the
   feature can use the less stable connection more than the user
   wants. Of course, this feature is opt-in for Git, but in Scalar
   we have a "scalar pause" command [2] that pauses all maintenance
   for some amount of time. We should consider a similar mechanism
   for Git, but for the point of this series the user needs to set
   up the "background" part of these jobs manually.

[2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
[3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 13 ++++-
 builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
 t/t7900-run-job.sh            | 22 +++++++++
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 8bf2762d650..eb92e891915 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job commit-graph'
+'git run-job (commit-graph|fetch)'
 
 
 DESCRIPTION
@@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+'fetch'::
+
+The `fetch` job updates the object directory with the latest objects
+from all registered remotes. For each remote, a `git fetch` command is
+run. The refmap is custom to avoid updating local or remote branches
+(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
+stored in `refs/hidden/<remote>/`. Also, no tags are updated.
++
+This means that foreground fetches are still required to update the
+remote refs, but the users is notified when the branches and tags are
+updated on the remote.
 
 GIT
 ---
diff --git a/builtin/run-job.c b/builtin/run-job.c
index dd7709952d3..e59056b2918 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -7,7 +7,7 @@
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job commit-graph"),
+	N_("git run-job (commit-graph|fetch)"),
 	NULL
 };
 
@@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
 	return 1;
 }
 
+static int fetch_remote(const char *remote)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf refmap = STRBUF_INIT;
+
+	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
+			 "--no-tags", "--refmap=", NULL);
+
+	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
+	argv_array_push(&cmd, refmap.buf);
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	strbuf_release(&refmap);
+	return result;
+}
+
+static int fill_remotes(struct string_list *remotes)
+{
+	int result = 0;
+	FILE *proc_out;
+	struct strbuf line = STRBUF_INIT;
+	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
+
+	child_process_init(remote_proc);
+
+	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
+
+	remote_proc->out = -1;
+
+	if (start_command(remote_proc)) {
+		error(_("failed to start 'git remote' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	proc_out = xfdopen(remote_proc->out, "r");
+
+	/* if there is no line, leave the value as given */
+	while (!strbuf_getline(&line, proc_out))
+		string_list_append(remotes, line.buf);
+
+	strbuf_release(&line);
+
+	fclose(proc_out);
+
+	if (finish_command(remote_proc)) {
+		error(_("failed to finish 'git remote' process"));
+		result = 1;
+	}
+
+cleanup:
+	free(remote_proc);
+	return result;
+}
+
+static int run_fetch_job(void)
+{
+	int result = 0;
+	struct string_list_item *item;
+	struct string_list remotes = STRING_LIST_INIT_DUP;
+
+	if (fill_remotes(&remotes)) {
+		error(_("failed to fill remotes"));
+		result = 1;
+		goto cleanup;
+	}
+
+	/*
+	 * Do not modify the result based on the success of the 'fetch'
+	 * operation, as a loss of network could cause 'fetch' to fail
+	 * quickly. We do not want that to stop the rest of our
+	 * background operations.
+	 */
+	for (item = remotes.items;
+	     item && item < remotes.items + remotes.nr;
+	     item++)
+		fetch_remote(item->string);
+
+cleanup:
+	string_list_clear(&remotes, 0);
+	return result;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 	if (argc > 0) {
 		if (!strcmp(argv[0], "commit-graph"))
 			return run_commit_graph_job();
+		if (!strcmp(argv[0], "fetch"))
+			return run_fetch_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 18b9bd26b3a..d3faeba135b 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
 	)
 '
 
+test_expect_success 'fetch job' '
+	git clone "file://$(pwd)/server" client &&
+
+	# Before fetching, build a client commit-graph
+	git -C client run-job commit-graph &&
+	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
+	test_line_count = 1 $chain &&
+
+	git -C client branch -v --remotes >before-refs &&
+	test_commit -C server 24 &&
+
+	git -C client run-job fetch &&
+	git -C client branch -v --remotes >after-refs &&
+	test_cmp before-refs after-refs &&
+	test_cmp server/.git/refs/heads/master \
+		 client/.git/refs/hidden/origin/master &&
+
+	# the hidden ref should trigger a new layer in the commit-graph
+	git -C client run-job commit-graph &&
+	test_line_count = 2 $chain
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 04/15] run-job: implement loose-objects job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 20:33   ` Junio C Hamano
  2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' job for the 'git run-job' command. This
helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 14 ++++-
 builtin/run-job.c             | 97 ++++++++++++++++++++++++++++++++++-
 t/t7900-run-job.sh            | 27 ++++++++++
 3 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index eb92e891915..43ca1160b5a 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch)'
+'git run-job (commit-graph|fetch|loose-objects)'
 
 
 DESCRIPTION
@@ -59,6 +59,18 @@ This means that foreground fetches are still required to update the
 remote refs, but the users is notified when the branches and tags are
 updated on the remote.
 
+'loose-objects'::
+
+The `loose-objects` job cleans up loose objects and places them into
+pack-files. In order to prevent race conditions with concurrent Git
+commands, it follows a two-step process. First, it deletes any loose
+objects that already exist in a pack-file; concurrent Git processes will
+examine the pack-file for the object data instead of the loose object.
+Second, it creates a new pack-file (starting with "loose-") containing
+a batch of loose objects. The batch size is limited to 50 thousand
+objects to prevent the job from taking too long on a repository with
+many loose objects.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/run-job.c b/builtin/run-job.c
index e59056b2918..cecf9058c51 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -7,7 +7,7 @@
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch)"),
+	N_("git run-job (commit-graph|fetch|loose-objects)"),
 	NULL
 };
 
@@ -145,6 +145,99 @@ static int run_fetch_job(void)
 	return result;
 }
 
+static int prune_packed(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "prune-packed", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+struct write_loose_object_data {
+	FILE *in;
+	int count;
+	int batch_size;
+};
+
+static int loose_object_exists(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+				       const char *path,
+				       void *data)
+{
+	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+	fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+	return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(void)
+{
+	int result = 0;
+	struct write_loose_object_data data;
+	struct strbuf prefix = STRBUF_INIT;
+	struct child_process *pack_proc;
+
+	/*
+	 * Do not start pack-objects process
+	 * if there are no loose objects.
+	 */
+	if (!for_each_loose_file_in_objdir(the_repository->objects->odb->path,
+					   loose_object_exists,
+					   NULL, NULL, NULL))
+		return 0;
+
+	pack_proc = xmalloc(sizeof(*pack_proc));
+
+	child_process_init(pack_proc);
+
+	strbuf_addstr(&prefix, the_repository->objects->odb->path);
+	strbuf_addstr(&prefix, "/pack/loose");
+
+	argv_array_pushl(&pack_proc->args, "git", "pack-objects",
+			 "--quiet", prefix.buf, NULL);
+
+	pack_proc->in = -1;
+
+	if (start_command(pack_proc)) {
+		error(_("failed to start 'git pack-objects' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	data.in = xfdopen(pack_proc->in, "w");
+	data.count = 0;
+	data.batch_size = 50000;
+
+	for_each_loose_file_in_objdir(the_repository->objects->odb->path,
+				      write_loose_object_to_stdin,
+				      NULL,
+				      NULL,
+				      &data);
+
+	fclose(data.in);
+
+	if (finish_command(pack_proc)) {
+		error(_("failed to finish 'git pack-objects' process"));
+		result = 1;
+	}
+
+cleanup:
+	strbuf_release(&prefix);
+	free(pack_proc);
+	return result;
+}
+
+static int run_loose_objects_job(void)
+{
+	return prune_packed() || pack_loose();
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -166,6 +259,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 			return run_commit_graph_job();
 		if (!strcmp(argv[0], "fetch"))
 			return run_fetch_job();
+		if (!strcmp(argv[0], "loose-objects"))
+			return run_loose_objects_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index d3faeba135b..41da083257b 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -66,4 +66,31 @@ test_expect_success 'fetch job' '
 	test_line_count = 2 $chain
 '
 
+test_expect_success 'loose-objects job' '
+	ls client/.git/objects >obj-dir-before &&
+	test_file_not_empty obj-dir-before &&
+	ls client/.git/objects/pack/*.pack >packs-before &&
+	test_line_count = 1 packs-before &&
+
+	# The first run creates a pack-file
+	# but does not delete loose objects.
+	git -C client run-job loose-objects &&
+	ls client/.git/objects >obj-dir-between &&
+	test_cmp obj-dir-before obj-dir-between &&
+	ls client/.git/objects/pack/*.pack >packs-between &&
+	test_line_count = 2 packs-between &&
+
+	# The second run deletes loose objects
+	# but does not create a pack-file.
+	git -C client run-job loose-objects &&
+	ls client/.git/objects >obj-dir-after &&
+	cat >expect <<-\EOF &&
+	info
+	pack
+	EOF
+	test_cmp expect obj-dir-after &&
+	ls client/.git/objects/pack/*.pack >packs-after &&
+	test_cmp packs-between packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 05/15] run-job: implement pack-files job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-05-27 22:17   ` Josh Steadmon
  2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'pack-files' job runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the pack-files job will delete these repacked
   pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3392 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 18 ++++++-
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
 midx.c                        |  2 +-
 midx.h                        |  1 +
 t/t7900-run-job.sh            | 39 +++++++++++++++
 5 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 43ca1160b5a..108ed25b8bd 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files)'
 
 
 DESCRIPTION
@@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
 objects to prevent the job from taking too long on a repository with
 many loose objects.
 
+'pack-files'::
+
+The `pack-files` job incrementally repacks the object directory using
+the `multi-pack-index` feature. In order to prevent race conditions with
+concurrent Git commands, it follows a two-step process. First, it
+deletes any pack-files included in the `multi-pack-index` where none of
+the objects in the `multi-pack-index` reference those pack-files; this
+only happens if all objects in the pack-file are also stored in a newer
+pack-file. Second, it selects a group of pack-files whose "expected
+size" is below the batch size until the group has total expected size at
+least the batch size; see the `--batch-size` option for the `repack`
+subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
+zero, which is a special case that attempts to repack all pack-files
+into a single pack-file.
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/run-job.c b/builtin/run-job.c
index cecf9058c51..d3543f7ccb9 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -1,13 +1,14 @@
 #include "builtin.h"
 #include "config.h"
 #include "commit-graph.h"
+#include "midx.h"
 #include "object-store.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
 	NULL
 };
 
@@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
 	return prune_packed() || pack_loose();
 }
 
+static int multi_pack_index_write(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "write",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int rewrite_multi_pack_index(void)
+{
+	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	if (multi_pack_index_write()) {
+		error(_("failed to rewrite multi-pack-index"));
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_verify(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "verify",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_expire(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "expire",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_repack(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "repack",
+			 "--no-progress", "--batch-size=0", NULL);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	if (result && multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after repack"));
+		result = rewrite_multi_pack_index();
+	}
+
+	return result;
+}
+
+static int run_pack_files_job(void)
+{
+	if (multi_pack_index_write()) {
+		error(_("failed to write multi-pack-index"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after initial write"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_expire()) {
+		error(_("multi-pack-index expire failed"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after expire"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_repack()) {
+		error(_("multi-pack-index repack failed"));
+		return 1;
+	}
+
+	return 0;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 			return run_fetch_job();
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
+		if (!strcmp(argv[0], "pack-files"))
+			return run_pack_files_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/midx.c b/midx.c
index 1527e464a7b..0f0d0a38812 100644
--- a/midx.c
+++ b/midx.c
@@ -36,7 +36,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index e6fa356b5ca..cf2c09dffc2 100644
--- a/midx.h
+++ b/midx.h
@@ -39,6 +39,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 41da083257b..416ba04989d 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
 '
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
 
 . ./test-lib.sh
 
@@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'pack-files job' '
+	packDir=.git/objects/pack &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+
+	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+
+	rm -f client/$packDir/pack-* &&
+	rm -f client/$packDir/loose-* &&
+
+	ls client/$packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git -C client run-job pack-files &&
+	ls client/$packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git -C client run-job pack-files &&
+	ls client/.git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 06/15] run-job: auto-size or use custom pack-files batch
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'pack-files' job, we use the --batch-size
option in 'git multi-pack-index repack'. The initial setting used
--batch-size=0 to repack everything into a single pack-file. This
is not sustaintable for a large repository. The amount of work
required is also likely to use too many system resources for a
background job.

Update the 'git run-job pack-files' command by allowing a direct
--batch-size option that can change the value provided. Update the
default of "0" to a computed size based on the existing pack-files.
While computing that new size, count the existing pack-files and
skip the repack step if there are at most two pack-files.

The dynamic default size is computed with this idea in mind for
a client repository that was cloned from a very large remote: there
is likely one "big" pack-file that was created at clone time. Thus,
do not try repacking it as it is likely packed efficiently by the
server. Instead, try packing the other pack-files into a single
pack-file.

The size is then computed as follows:

	batch size = total size - max pack size

The batch size is then also limited to be at most two gigabytes.
This serves two purposes. First, having a limit prevents doing too
much work when the repository is extremely large. Pack-files
larger than two gigabytes are likely to either contain large blobs
or have been carefully repacked by a previous repack operation.
Second, two gigabytes is the size limit for a signed 32-bit int.
It's a good limit to consider, and to keep it far away from the
unsigned 32-bit int limit. This limit comes to mind because on
Windows an "unsigned long" is 32 bits and OPT_MAGNITUDE() uses
unsigned longs for its parsing logic.

This calculation mimics a similar calculation in Scalar [1],
except for a 3% drop in the calculated batch size due to the
round-off error that can happen with the "expected size"
calculation for a pack-file.

[1] https://github.com/microsoft/scalar/blob/616e9b16dd120b8fdb652d6d5a55618c731a8aea/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L143

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 13 +++--
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++---
 t/t7900-run-job.sh            |  6 ++-
 3 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 108ed25b8bd..cdd6417f7c9 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects|pack-files)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]'
 
 
 DESCRIPTION
@@ -82,9 +82,14 @@ only happens if all objects in the pack-file are also stored in a newer
 pack-file. Second, it selects a group of pack-files whose "expected
 size" is below the batch size until the group has total expected size at
 least the batch size; see the `--batch-size` option for the `repack`
-subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
-zero, which is a special case that attempts to repack all pack-files
-into a single pack-file.
+subcommand in linkgit:git-multi-pack-index[1].
++
+The default batch size is computed to optimize for having a single large
+pack-file and many small pack-files. When there are two or fewer
+pack-files, the job does not attempt to repack. Otherwise, the batch
+size is the sum of all pack-file sizes minus the largest pack-file size.
+The batch size is capped at two gigabytes. This intends to pack all
+small pack-files into a single pack-file.
 
 
 GIT
diff --git a/builtin/run-job.c b/builtin/run-job.c
index d3543f7ccb9..2ccc3bbae2d 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -3,12 +3,18 @@
 #include "commit-graph.h"
 #include "midx.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]"),
+	NULL
+};
+
+static char const * const builtin_run_job_pack_file_usage[] = {
+	N_("git run-job pack-files [--batch-size=<size>]"),
 	NULL
 };
 
@@ -278,15 +284,74 @@ static int multi_pack_index_expire(void)
 	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 }
 
-static int multi_pack_index_repack(void)
+#define TWO_GIGABYTES (2147483647)
+
+static off_t get_auto_pack_size(int *count)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly. Find the sum of the sizes
+	 * other than the largest pack-file, then use
+	 * that as the batch size.
+	 */
+	off_t total_size = 0;
+	off_t max_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+
+	*count = 0;
+
+	reprepare_packed_git(the_repository);
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		(*count)++;
+		total_size += p->pack_size;
+
+		if (p->pack_size > max_size)
+			max_size = p->pack_size;
+	}
+
+	result_size = total_size - max_size;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+static int multi_pack_index_repack(unsigned long batch_size)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+	int count;
+	off_t default_size = get_auto_pack_size(&count);
+
+	if (count <= 2)
+		return 0;
+
+	strbuf_addstr(&batch_arg, "--batch-size=");
+
+	if (batch_size != UNSET_BATCH_SIZE)
+		strbuf_addf(&batch_arg, "\"%"PRIuMAX"\"", (uintmax_t)batch_size);
+	else
+		strbuf_addf(&batch_arg, "%"PRIuMAX,
+			    (uintmax_t)default_size);
+
 	argv_array_pushl(&cmd, "multi-pack-index", "repack",
-			 "--no-progress", "--batch-size=0", NULL);
+			 "--no-progress", batch_arg.buf, NULL);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 
-	if (result && multi_pack_index_verify()) {
+	strbuf_release(&batch_arg);
+
+	/*
+	 * Verify here to avoid verifying again when there are two
+	 * or fewer pack-files.
+	 */
+	if (!result && multi_pack_index_verify()) {
 		warning(_("multi-pack-index verify failed after repack"));
 		result = rewrite_multi_pack_index();
 	}
@@ -294,8 +359,19 @@ static int multi_pack_index_repack(void)
 	return result;
 }
 
-static int run_pack_files_job(void)
+static int run_pack_files_job(int argc, const char **argv)
 {
+	static unsigned long batch_size = UNSET_BATCH_SIZE;
+	static struct option builtin_run_job_pack_file_options[] = {
+		OPT_MAGNITUDE(0, "batch-size", &batch_size,
+			      N_("specify a batch-size for the incremental repack")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_run_job_pack_file_options,
+			     builtin_run_job_pack_file_usage, 0);
+
 	if (multi_pack_index_write()) {
 		error(_("failed to write multi-pack-index"));
 		return 1;
@@ -316,7 +392,7 @@ static int run_pack_files_job(void)
 		return rewrite_multi_pack_index();
 	}
 
-	if (multi_pack_index_repack()) {
+	if (multi_pack_index_repack(batch_size)) {
 		error(_("multi-pack-index repack failed"));
 		return 1;
 	}
@@ -348,7 +424,7 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
 		if (!strcmp(argv[0], "pack-files"))
-			return run_pack_files_job();
+			return run_pack_files_job(argc, argv);
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 416ba04989d..2d9f6cdf328 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -128,8 +128,10 @@ test_expect_success 'pack-files job' '
 	# the job deletes the two old packs, and does not write
 	# a new one because only one pack remains.
 	git -C client run-job pack-files &&
-	ls client/.git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	ls client/$packDir/*.pack >packs-after &&
+	test_line_count = 2 packs-after &&
+	cat packs-after | grep "pack/test-1-" &&
+	cat packs-after | grep "pack/pack-"
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 07/15] config: add job.pack-files.batchSize option
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change allowed a user to specify a --batch-size=<size>
option to 'git run-job pack-files'. However, when we eventually
launch these jobs on a schedule, we want users to be able to change
this value through config options.

The new "job.pack-files.batchSize" option will override the default
dynamic batch-size calculation, but will be overridden by the
--batch-size=<size> argument.

This is the first config option of the type
"job.<job-name>.<option>" but is not the last.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt     | 2 ++
 Documentation/config/job.txt | 6 ++++++
 builtin/run-job.c            | 7 ++++++-
 3 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/config/job.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 2450589a0ed..c4c5fa99e6b 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -389,6 +389,8 @@ include::config/instaweb.txt[]
 
 include::config/interactive.txt[]
 
+include::config/job.txt[]
+
 include::config/log.txt[]
 
 include::config/mailinfo.txt[]
diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
new file mode 100644
index 00000000000..efdb76afad3
--- /dev/null
+++ b/Documentation/config/job.txt
@@ -0,0 +1,6 @@
+job.pack-files.batchSize::
+	This string value `<size>` will be passed to the
+	`git multi-pack-index repack --batch-size=<size>` command as
+	part of `git run-job pack-files`. If not specified, then a
+	dynamic size calculation is run. See linkgit:git-run-job[1]
+	for more details.
diff --git a/builtin/run-job.c b/builtin/run-job.c
index 2ccc3bbae2d..76765535e09 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -327,6 +327,7 @@ static int multi_pack_index_repack(unsigned long batch_size)
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
 	struct strbuf batch_arg = STRBUF_INIT;
+	const char *config_value;
 	int count;
 	off_t default_size = get_auto_pack_size(&count);
 
@@ -336,7 +337,11 @@ static int multi_pack_index_repack(unsigned long batch_size)
 	strbuf_addstr(&batch_arg, "--batch-size=");
 
 	if (batch_size != UNSET_BATCH_SIZE)
-		strbuf_addf(&batch_arg, "\"%"PRIuMAX"\"", (uintmax_t)batch_size);
+		strbuf_addf(&batch_arg, "\"%"PRIuMAX"\"", (uintmax_t) batch_size);
+	else if (!repo_config_get_string_const(the_repository,
+					       "job.pack-file.batchsize",
+					       &config_value))
+		strbuf_addf(&batch_arg, "\"%s\"", config_value);
 	else
 		strbuf_addf(&batch_arg, "%"PRIuMAX,
 			    (uintmax_t)default_size);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 08/15] job-runner: create builtin for job loop
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Given the existing 'git run-job <job-name>' builtin, users _could_
construct their own scheduling mechanism for running jobs in the
background. However, it is much easier to have a dedicated process
that manages all jobs across multiple repos.

The 'git job-runner' builtin is specifically built to handle this
scenario. It will be customized further in later changes, but for
now it does the following:

* Given a list of '--repo=<path>' arguments, construct a list of
  repositories to manage with jobs.

* Every 30 minutes, iterate over all jobs and all repos to run

	git -C <repo> run-job <job-name>

This builtin needs to be careful about how much of the Git internals
it consumes. The intention is that this is a long-lived process that
could be launched upon login and only closed on logout. For that
reason, we will avoid instantiating any object store or index data.

Run the maintenance jobs by running subcommands. We will update how
we enable or disable these jobs and separate their runs in a later
change.

RFC QUESTIONS:

1. The hardest part of this builtin is "how do we test it?" In
   Scalar, we can unit test the scheduler with mocks. What is the
   equivalent here for "make sure 'git job-runner' runs 'git run-job
   pack-files' on repo X? I expect to add a "--no-loop" option that
   ensures the logic only runs one iteration of the loop.

2. The difference between 'git run-job' and 'git job-runner' is
   subtle and probably confusing. Are there better names?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                       |   1 +
 Documentation/git-job-runner.txt |  52 +++++++++++++++
 Makefile                         |   1 +
 builtin.h                        |   1 +
 builtin/job-runner.c             | 111 +++++++++++++++++++++++++++++++
 command-list.txt                 |   1 +
 git.c                            |   1 +
 7 files changed, 168 insertions(+)
 create mode 100644 Documentation/git-job-runner.txt
 create mode 100644 builtin/job-runner.c

diff --git a/.gitignore b/.gitignore
index 5dea9d3b96b..24c377b2883 100644
--- a/.gitignore
+++ b/.gitignore
@@ -83,6 +83,7 @@
 /git-init-db
 /git-interpret-trailers
 /git-instaweb
+/git-job-runner
 /git-log
 /git-ls-files
 /git-ls-remote
diff --git a/Documentation/git-job-runner.txt b/Documentation/git-job-runner.txt
new file mode 100644
index 00000000000..010b2d05f9b
--- /dev/null
+++ b/Documentation/git-job-runner.txt
@@ -0,0 +1,52 @@
+git-job-runner(1)
+=================
+
+NAME
+----
+git-job-runner - run jobs on multiple repos according to a schedlue.
+Intended for background operation.
+
+
+SYNOPSIS
+--------
+[verse]
+'git job-runner [--repo=<path>]'
+
+
+DESCRIPTION
+-----------
+
+Run jobs in a loop with some frequency. Intended for running in the
+background.
+
+The `git run-job <job-name>` command runs a specific maintenance task.
+The `job-runner` command is a long-running process that calls the
+`run-job` command on a set of repositories at some frequency. The
+`job.*` config values customize the frequencies of these jobs.
+
+
+OPTIONS
+-------
+
+--repo=<dir>::
+	If at least one `--repo` option is provided, the runner only
+	attempts running jobs on repositories located at the provided
+	`<dir>` values. This option can be specified multiple times.
+
+
+CONFIGURATION
+-------------
+
+The `git job-runner` command is intended to run as a long-running
+process. The following config options are checked periodically during
+the process and will modify its behavior:
+
+The below documentation is the same as what's found in
+linkgit:git-config[1]:
+
+include::config/job.txt[]
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index f5f9c4d9e94..ee98a91af0f 100644
--- a/Makefile
+++ b/Makefile
@@ -1082,6 +1082,7 @@ BUILTIN_OBJS += builtin/help.o
 BUILTIN_OBJS += builtin/index-pack.o
 BUILTIN_OBJS += builtin/init-db.o
 BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/job-runner.o
 BUILTIN_OBJS += builtin/log.o
 BUILTIN_OBJS += builtin/ls-files.o
 BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index 3e0ddaaf67f..5008c7096b3 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_help(int argc, const char **argv, const char *prefix);
 int cmd_index_pack(int argc, const char **argv, const char *prefix);
 int cmd_init_db(int argc, const char **argv, const char *prefix);
 int cmd_interpret_trailers(int argc, const char **argv, const char *prefix);
+int cmd_job_runner(int argc, const char **argv, const char *prefix);
 int cmd_log(int argc, const char **argv, const char *prefix);
 int cmd_log_reflog(int argc, const char **argv, const char *prefix);
 int cmd_ls_files(int argc, const char **argv, const char *prefix);
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
new file mode 100644
index 00000000000..135288bcaae
--- /dev/null
+++ b/builtin/job-runner.c
@@ -0,0 +1,111 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+#include "run-command.h"
+#include "string-list.h"
+
+static char const * const builtin_job_runner_usage[] = {
+	N_("git job-runner [<options>]"),
+	NULL
+};
+
+static struct string_list arg_repos = STRING_LIST_INIT_DUP;
+
+static int arg_repos_append(const struct option *opt,
+			    const char *arg, int unset)
+{
+	string_list_append(&arg_repos, arg);
+	return 0;
+}
+
+static int load_active_repos(struct string_list *repos)
+{
+	if (arg_repos.nr) {
+		struct string_list_item *item;
+		for (item = arg_repos.items;
+		     item && item < arg_repos.items + arg_repos.nr;
+		     item++)
+			string_list_append(repos, item->string);
+		return 0;
+	}
+
+	return 0;
+}
+
+static int run_job(const char *job, const char *repo)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", repo, "run-job", job, NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int run_job_loop_step(struct string_list *list)
+{
+	int result = 0;
+	struct string_list_item *job;
+	struct string_list repos = STRING_LIST_INIT_DUP;
+
+	if ((result = load_active_repos(&repos)))
+		return result;
+
+	for (job = list->items;
+	     !result && job && job < list->items + list->nr;
+	     job++) {
+		struct string_list_item *repo;
+		for (repo = repos.items;
+		     !result && repo && repo < repos.items + repos.nr;
+		     repo++)
+			result = run_job(job->string,
+					 repo->string);
+	}
+
+	string_list_clear(&repos, 0);
+	return result;
+}
+
+static unsigned int get_loop_interval(void)
+{
+	/* Default: 30 minutes */
+	return 30 * 60;
+}
+
+static int initialize_jobs(struct string_list *list)
+{
+	string_list_append(list, "commit-graph");
+	string_list_append(list, "fetch");
+	string_list_append(list, "loose-objects");
+	string_list_append(list, "pack-files");
+	return 0;
+}
+
+int cmd_job_runner(int argc, const char **argv, const char *prefix)
+{
+	int result;
+	struct string_list job_list = STRING_LIST_INIT_DUP;
+	static struct option builtin_job_runner_options[] = {
+		OPT_CALLBACK_F(0, "repo",
+			       NULL,
+			       N_("<path>"),
+			       N_("run jobs on the repository at <path>"),
+			       PARSE_OPT_NONEG, arg_repos_append),
+		OPT_END(),
+	};
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_job_runner_usage,
+				   builtin_job_runner_options);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_job_runner_options,
+			     builtin_job_runner_usage,
+			     0);
+
+	result = initialize_jobs(&job_list);
+
+	while (!(result = run_job_loop_step(&job_list))) {
+		unsigned int interval = get_loop_interval();
+		sleep(interval);
+	}
+
+	return result;
+}
diff --git a/command-list.txt b/command-list.txt
index 1cd2b415e46..cb835c74693 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -110,6 +110,7 @@ git-init                                mainporcelain           init
 git-instaweb                            ancillaryinterrogators          complete
 git-interpret-trailers                  purehelpers
 gitk                                    mainporcelain
+git-job-runner                          plumbingmanipulators
 git-log                                 mainporcelain           info
 git-ls-files                            plumbinginterrogators
 git-ls-remote                           plumbinginterrogators
diff --git a/git.c b/git.c
index db5a43c8687..3a2da2c232f 100644
--- a/git.c
+++ b/git.c
@@ -517,6 +517,7 @@ static struct cmd_struct commands[] = {
 	{ "init", cmd_init_db },
 	{ "init-db", cmd_init_db },
 	{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+	{ "job-runner", cmd_job_runner, RUN_SETUP_GENTLY },
 	{ "log", cmd_log, RUN_SETUP },
 	{ "ls-files", cmd_ls_files, RUN_SETUP },
 	{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 09/15] job-runner: load repos from config by default
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 15:18   ` Phillip Wood
  2020-04-05 15:41   ` Phillip Wood
  2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git job-runner' command was introduced with a '--repo=<path>'
option so a user could start a process with an explicit list of
repos. However, it is more likely that we will want 'git
job-runner' to start without any arguments and discover the set of
repos from another source.

A natural source is the Git config. The 'git job-runner' does not
need to run in a Git repository, but the config could be located in
the global or system config.

Create the job.repo config setting as a multi-valued config option.

Read all values for job.repo whenever triggering an iteration of
the job loop. This allows a user to add or remove repos dynamically
without restarting the job-runner.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/job.txt |  7 +++++++
 builtin/job-runner.c         | 20 ++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
index efdb76afad3..2dfed3935fa 100644
--- a/Documentation/config/job.txt
+++ b/Documentation/config/job.txt
@@ -4,3 +4,10 @@ job.pack-files.batchSize::
 	part of `git run-job pack-files`. If not specified, then a
 	dynamic size calculation is run. See linkgit:git-run-job[1]
 	for more details.
+
+job.repo::
+	Store an absolute path to a repository that wants background
+	jobs run by `git job-runner`. This is a multi-valued config
+	option, but it must be stored in a config file seen by the
+	background runner. This may be the global or system config
+	depending on how `git job-runner` is launched on your system.
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index 135288bcaae..bed401f94bf 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
 
 static int load_active_repos(struct string_list *repos)
 {
+	struct string_list_item *item;
+	const struct string_list *config_repos;
+
 	if (arg_repos.nr) {
 		struct string_list_item *item;
 		for (item = arg_repos.items;
@@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
 		return 0;
 	}
 
+	config_repos = git_config_get_value_multi("job.repo");
+
+	for (item = config_repos->items;
+	     item && item < config_repos->items + config_repos->nr;
+	     item++) {
+		DIR *dir = opendir(item->string);
+
+		if (!dir)
+			continue;
+
+		closedir(dir);
+		string_list_append(repos, item->string);
+	}
+
+	string_list_sort(repos);
+	string_list_remove_duplicates(repos, 0);
+
 	return 0;
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 10/15] job-runner: use config to limit job frequency
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 15:24   ` Phillip Wood
  2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We want to run different maintenance tasks at different rates. That
means we cannot rely only on the time between job-runner loop pauses
to reduce the frequency of specific jobs. This means we need to
persist a timestamp for the previous run somewhere. A natural place
is the Git config file for that repo.

Create the job.<job-namme>.lastRun config option to store the
timestamp of the previous run of that job. Set this config option
after a successful run of 'git -C <repo> run-job <job-name>'.

To maximize code reuse, we dynamically construct the config key
and parse the config value into a timestamp in a generic way. This
makes the introduction of another config option extremely trivial:

The job.<job-name>.interval option allows a user to specify a
minimum number of seconds between two calls to 'git run-job
<job-name>' on a given repo. This could be stored in the global
or system config to provide an update on the default for all repos,
or could be updated on a per-repo basis. This is checked on every
iteration of the job loop, so a user could update this and see the
effect without restarting the job-runner process.

RFC QUESTION: I'm using a 'git -C <repo> config <option>' process
to test if enough time has elapsed before running the 'run-job'
process. These 'config' processes are pretty light-weight, so
hopefully they are not too noisy. An alternative would be to
always run 'git -C <repo> run-job <job-name>' and rely on that
process to no-op based on the configured values and how recently
they were run. However, that will likely interrupt users who want
to test the jobs in the foreground. Finally, that user disruption
would be mitigated by adding a "--check-latest" option. A user
running a job manually would not include that argument by default
and would succeed. However, any logging that we might do for the
job-runner would make it look like we are running the run-job
commands all the time even if they are no-ops. Advice welcome!

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/job.txt  |  10 ++
 Documentation/git-run-job.txt |   3 +
 builtin/job-runner.c          | 177 +++++++++++++++++++++++++++++++++-
 3 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
index 2dfed3935fa..7c799d66221 100644
--- a/Documentation/config/job.txt
+++ b/Documentation/config/job.txt
@@ -1,3 +1,13 @@
+job.<job-name>.interval::
+	The minimum number of seconds between runs of
+	`git run-job <job-name>` when running `git job-runner`.
+
+job.<job-name>.lastRun::
+	The Unix epoch for the timestamp of the previous run of
+	`git run-job <job-name>` when running `git job-runner`. You
+	can manually update this to a later time to delay a specific
+	job on this repository.
+
 job.pack-files.batchSize::
 	This string value `<size>` will be passed to the
 	`git multi-pack-index repack --batch-size=<size>` command as
diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index cdd6417f7c9..c6d5674d699 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -90,6 +90,9 @@ pack-files, the job does not attempt to repack. Otherwise, the batch
 size is the sum of all pack-file sizes minus the largest pack-file size.
 The batch size is capped at two gigabytes. This intends to pack all
 small pack-files into a single pack-file.
++
+The `--batch-size=<size>` option will override the dynamic or configured
+batch size.
 
 
 GIT
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index bed401f94bf..aee55c106e8 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -4,11 +4,175 @@
 #include "run-command.h"
 #include "string-list.h"
 
+#define MAX_TIMESTAMP ((timestamp_t)-1)
+
 static char const * const builtin_job_runner_usage[] = {
 	N_("git job-runner [<options>]"),
 	NULL
 };
 
+static char *config_name(const char *prefix,
+			 const char *job,
+			 const char *postfix)
+{
+	int postfix_dot = 0;
+	struct strbuf name = STRBUF_INIT;
+
+	if (prefix) {
+		strbuf_addstr(&name, prefix);
+		strbuf_addch(&name, '.');
+	}
+
+	if (job) {
+		strbuf_addstr(&name, job);
+		postfix_dot = 1;
+	}
+
+	if (postfix) {
+		if (postfix_dot)
+			strbuf_addch(&name, '.');
+		strbuf_addstr(&name, postfix);
+	}
+
+	return strbuf_detach(&name, NULL);
+}
+
+static int try_get_config(const char *job,
+			  const char *repo,
+			  const char *postfix,
+			  char **value)
+{
+	int result = 0;
+	FILE *proc_out;
+	struct strbuf line = STRBUF_INIT;
+	char *cname = config_name("job", job, postfix);
+	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
+
+	child_process_init(config_proc);
+
+	argv_array_push(&config_proc->args, "git");
+	argv_array_push(&config_proc->args, "-C");
+	argv_array_push(&config_proc->args, repo);
+	argv_array_push(&config_proc->args, "config");
+	argv_array_push(&config_proc->args, cname);
+	free(cname);
+
+	config_proc->out = -1;
+
+	if (start_command(config_proc)) {
+		warning(_("failed to start process for repo '%s'"),
+			repo);
+		result = 1;
+		goto cleanup;
+	}
+
+	proc_out = xfdopen(config_proc->out, "r");
+
+	/* if there is no line, leave the value as given */
+	if (!strbuf_getline(&line, proc_out))
+		*value = strbuf_detach(&line, NULL);
+	else
+		*value = NULL;
+
+	strbuf_release(&line);
+
+	fclose(proc_out);
+
+	result = finish_command(config_proc);
+
+cleanup:
+	free(config_proc);
+	return result;
+}
+
+static int try_get_timestamp(const char *job,
+			     const char *repo,
+			     const char *postfix,
+			     timestamp_t *t)
+{
+	char *value;
+	int result = try_get_config(job, repo, postfix, &value);
+
+	if (!result) {
+		*t = atol(value);
+		free(value);
+	}
+
+	return result;
+}
+
+static timestamp_t get_last_run(const char *job,
+				const char *repo)
+{
+	timestamp_t last_run = 0;
+
+	/* In an error state, do not run the job */
+	if (try_get_timestamp(job, repo, "lastrun", &last_run))
+		return MAX_TIMESTAMP;
+
+	return last_run;
+}
+
+static timestamp_t get_interval(const char *job,
+				const char *repo)
+{
+	timestamp_t interval = MAX_TIMESTAMP;
+
+	/* In an error state, do not run the job */
+	if (try_get_timestamp(job, repo, "interval", &interval))
+		return MAX_TIMESTAMP;
+
+	if (interval == MAX_TIMESTAMP) {
+		/* load defaults for each job */
+		if (!strcmp(job, "commit-graph") || !strcmp(job, "fetch"))
+			interval = 60 * 60; /* 1 hour */
+		else
+			interval = 24 * 60 * 60; /* 1 day */
+	}
+
+	return interval;
+}
+
+static int set_last_run(const char *job,
+			const char *repo,
+			timestamp_t last_run)
+{
+	int result = 0;
+	struct strbuf last_run_string = STRBUF_INIT;
+	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
+	char *cname = config_name("job", job, "lastrun");
+
+	strbuf_addf(&last_run_string, "%"PRItime, last_run);
+
+	child_process_init(config_proc);
+
+	argv_array_push(&config_proc->args, "git");
+	argv_array_push(&config_proc->args, "-C");
+	argv_array_push(&config_proc->args, repo);
+	argv_array_push(&config_proc->args, "config");
+	argv_array_push(&config_proc->args, cname);
+	argv_array_push(&config_proc->args, last_run_string.buf);
+	free(cname);
+	strbuf_release(&last_run_string);
+
+	if (start_command(config_proc)) {
+		warning(_("failed to start process for repo '%s'"),
+			repo);
+		result = 1;
+		goto cleanup;
+	}
+
+	if (finish_command(config_proc)) {
+		warning(_("failed to finish process for repo '%s'"),
+			repo);
+		result = 1;
+	}
+
+cleanup:
+	free(config_proc);
+	return result;
+}
+
 static struct string_list arg_repos = STRING_LIST_INIT_DUP;
 
 static int arg_repos_append(const struct option *opt,
@@ -54,9 +218,20 @@ static int load_active_repos(struct string_list *repos)
 
 static int run_job(const char *job, const char *repo)
 {
+	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	timestamp_t now = time(NULL);
+	timestamp_t last_run = get_last_run(job, repo);
+	timestamp_t interval = get_interval(job, repo);
+
+	if (last_run + interval > now)
+		return 0;
+
 	argv_array_pushl(&cmd, "-C", repo, "run-job", job, NULL);
-	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	set_last_run(job, repo, now);
+	return result;
 }
 
 static int run_job_loop_step(struct string_list *list)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 11/15] job-runner: use config for loop interval
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git job-runner' process uses a sleep(X) call to pause its
operation between iterations of the job loop. This defaults to 30
minutes, but is now available to be done with longer or shorter
intervals according to the job.loopInterval config option. For
example, a user may want the job loop to run once every five
minutes while another wants the job loop to run once every six
hours.

This config value is checked immediately before the sleep(X) call,
which allows users to see the effect without restarting the
job-runner process. However, the process will be paused until the
previous sleep(X) call returns and the job loop is executed.

RFC QUESITON: Is this use of sleep(X) the best way to do this?
Is there a better way to delay the process for a time interval, or
until a specified time? This just seemed like the simplest option.
The job-runner is doing low-priority work on an unpredictable
schedule by design, so sleep(X) seemd appropriate.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/job.txt | 4 ++++
 builtin/job-runner.c         | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
index 7c799d66221..772001e6744 100644
--- a/Documentation/config/job.txt
+++ b/Documentation/config/job.txt
@@ -1,3 +1,7 @@
+job.loopInterval::
+	The number of seconds to sleep between rounds of running
+	background jobs in `git job-runner`.
+
 job.<job-name>.interval::
 	The minimum number of seconds between runs of
 	`git run-job <job-name>` when running `git job-runner`.
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index aee55c106e8..7e37b122d99 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -261,7 +261,11 @@ static int run_job_loop_step(struct string_list *list)
 static unsigned int get_loop_interval(void)
 {
 	/* Default: 30 minutes */
-	return 30 * 60;
+	timestamp_t interval = 30 * 60;
+
+	try_get_timestamp(NULL, ".", "loopinterval", &interval);
+
+	return interval;
 }
 
 static int initialize_jobs(struct string_list *list)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 12/15] job-runner: add --interval=<span> option
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change used the job.loopInterval config option to
adjust the default sleep between iterations of the job loop in
'git job-runner'. Now, add a command-line option so a user can
override the configured value with an explicit value.

The value specified by this command-line option cannot be
changed without restarting the job-runner process.

RFC QUESTION: Are things like this useful, or should we only
use config options for this customization?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-job-runner.txt | 6 ++++++
 builtin/job-runner.c             | 7 +++++++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/git-job-runner.txt b/Documentation/git-job-runner.txt
index 010b2d05f9b..0719113a008 100644
--- a/Documentation/git-job-runner.txt
+++ b/Documentation/git-job-runner.txt
@@ -33,6 +33,12 @@ OPTIONS
 	attempts running jobs on repositories located at the provided
 	`<dir>` values. This option can be specified multiple times.
 
+--interval=<span>::
+	The job runner pauses between each attempt to run jobs. Use the
+	specified `<span>` as a number of seconds between each attempt.
+	If this is not specified, then the default will be found from
+	`jobs.loopInterval` or the default value of 30 minutes.
+
 
 CONFIGURATION
 -------------
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index 7e37b122d99..d1fca2c97b8 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -258,11 +258,16 @@ static int run_job_loop_step(struct string_list *list)
 	return result;
 }
 
+static unsigned int arg_interval = 0;
+
 static unsigned int get_loop_interval(void)
 {
 	/* Default: 30 minutes */
 	timestamp_t interval = 30 * 60;
 
+	if (arg_interval)
+		return arg_interval;
+
 	try_get_timestamp(NULL, ".", "loopinterval", &interval);
 
 	return interval;
@@ -287,6 +292,8 @@ int cmd_job_runner(int argc, const char **argv, const char *prefix)
 			       N_("<path>"),
 			       N_("run jobs on the repository at <path>"),
 			       PARSE_OPT_NONEG, arg_repos_append),
+		OPT_INTEGER(0, "interval", &arg_interval,
+			    N_("seconds to pause between running any jobs")),
 		OPT_END(),
 	};
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Allow a user to specify that a job should not be run on a given
repo by setting "job.<job-name>.enabled = false". The job-runner
will check this config before any other options and stop running
the job on that repo. If the config is disabled in the config
read by the job-runner itself (i.e. in global config) then the
job-runner will not attempt to run the job on any of the repos.

Since this config is checked dynamically, the job-runner does will
see config changes without needing to restart the process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/job.txt |  4 ++++
 builtin/job-runner.c         | 26 ++++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
index 772001e6744..6c22a40dd36 100644
--- a/Documentation/config/job.txt
+++ b/Documentation/config/job.txt
@@ -2,6 +2,10 @@ job.loopInterval::
 	The number of seconds to sleep between rounds of running
 	background jobs in `git job-runner`.
 
+job.<job-name>.enabled::
+	If set to `false`, this option will disable the `<job-name>`
+	when run by `git job-runner`.
+
 job.<job-name>.interval::
 	The minimum number of seconds between runs of
 	`git run-job <job-name>` when running `git job-runner`.
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index d1fca2c97b8..45f82a50d49 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -216,13 +216,31 @@ static int load_active_repos(struct string_list *repos)
 	return 0;
 }
 
+static int job_disabled(const char *job, const char *repo)
+{
+	char *enabled;
+
+	if (!try_get_config(job, repo, "enabled", &enabled)) {
+		int enabled_int = git_parse_maybe_bool(enabled);
+		free(enabled);
+		return !enabled_int;
+	}
+
+	return 0;
+}
+
 static int run_job(const char *job, const char *repo)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
 	timestamp_t now = time(NULL);
-	timestamp_t last_run = get_last_run(job, repo);
-	timestamp_t interval = get_interval(job, repo);
+	timestamp_t last_run, interval;
+
+	if (job_disabled(job, repo))
+		return 0;
+
+	last_run = get_last_run(job, repo);
+	interval = get_interval(job, repo);
 
 	if (last_run + interval > now)
 		return 0;
@@ -247,6 +265,10 @@ static int run_job_loop_step(struct string_list *list)
 	     !result && job && job < list->items + list->nr;
 	     job++) {
 		struct string_list_item *repo;
+
+		if (job_disabled(job->string, "."))
+			continue;
+
 		for (repo = repos.items;
 		     !result && repo && repo < repos.items + repos.nr;
 		     repo++)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 14/15] job-runner: add --daemonize option
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
  2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Git has the ability to fork the process and launch a "daemonized"
copy of the current process. This is used to launch a background
'git gc' command in some cases or to launch the 'git daemon'
server process.

Update the 'git job-runner' command with a --daemonize option that
launches this background process.

The implementation of daemonize() in setup.c is very clear that
this may no-op and return an error if NO_POSIX_GOODIES is not
defined. Include an error message to point out that this mechanism
may not be available on a platform-by-platform basis.

Include a clear error message that daemonize() might fail due to
platform incompatibilities.

I have been running the current version of this series on my
Linux VM using --daemonize to keep my copies of torvalds/linux and
git/git maintained. Using GIT_TRACE2_PERF set to a path, I can see
that the 'git run-job' processes are being created on the correct
schedule according to my config for each.

RFC QUESTION: I notice that 'git gc' does not document --daemonize.
Is that intentional? Or is it an oversight?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-job-runner.txt |  5 +++++
 builtin/job-runner.c             | 12 ++++++++++--
 cache.h                          |  4 +---
 daemon.h                         |  7 +++++++
 4 files changed, 23 insertions(+), 5 deletions(-)
 create mode 100644 daemon.h

diff --git a/Documentation/git-job-runner.txt b/Documentation/git-job-runner.txt
index 0719113a008..f48d6bcd10b 100644
--- a/Documentation/git-job-runner.txt
+++ b/Documentation/git-job-runner.txt
@@ -39,6 +39,11 @@ OPTIONS
 	If this is not specified, then the default will be found from
 	`jobs.loopInterval` or the default value of 30 minutes.
 
+--daemonize::
+	If supported by your platform, launch an identical
+	`git job-runner` process in the background and close the
+	foreground process immediately.
+
 
 CONFIGURATION
 -------------
diff --git a/builtin/job-runner.c b/builtin/job-runner.c
index 45f82a50d49..3b629428ef1 100644
--- a/builtin/job-runner.c
+++ b/builtin/job-runner.c
@@ -1,5 +1,6 @@
 #include "builtin.h"
 #include "config.h"
+#include "daemon.h"
 #include "parse-options.h"
 #include "run-command.h"
 #include "string-list.h"
@@ -78,7 +79,8 @@ static int try_get_config(const char *job,
 
 	fclose(proc_out);
 
-	result = finish_command(config_proc);
+	/* ignore result as 'git config' fails on non-existent value */
+	finish_command(config_proc);
 
 cleanup:
 	free(config_proc);
@@ -93,7 +95,7 @@ static int try_get_timestamp(const char *job,
 	char *value;
 	int result = try_get_config(job, repo, postfix, &value);
 
-	if (!result) {
+	if (!result && value) {
 		*t = atol(value);
 		free(value);
 	}
@@ -304,6 +306,7 @@ static int initialize_jobs(struct string_list *list)
 	return 0;
 }
 
+static int arg_daemonize = 0;
 int cmd_job_runner(int argc, const char **argv, const char *prefix)
 {
 	int result;
@@ -316,6 +319,8 @@ int cmd_job_runner(int argc, const char **argv, const char *prefix)
 			       PARSE_OPT_NONEG, arg_repos_append),
 		OPT_INTEGER(0, "interval", &arg_interval,
 			    N_("seconds to pause between running any jobs")),
+		OPT_BOOL(0, "daemonize", &arg_daemonize,
+			 N_("request to spawn a background process")),
 		OPT_END(),
 	};
 
@@ -328,6 +333,9 @@ int cmd_job_runner(int argc, const char **argv, const char *prefix)
 			     builtin_job_runner_usage,
 			     0);
 
+	if (arg_daemonize && daemonize())
+		die(_("failed to daemonize; this may not be available on your platform."));
+
 	result = initialize_jobs(&job_list);
 
 	while (!(result = run_job_loop_step(&job_list))) {
diff --git a/cache.h b/cache.h
index c77b95870a5..34ef690faf6 100644
--- a/cache.h
+++ b/cache.h
@@ -17,6 +17,7 @@
 #include "sha1-array.h"
 #include "repository.h"
 #include "mem-pool.h"
+#include "daemon.h"
 
 #include <zlib.h>
 typedef struct git_zstream {
@@ -631,9 +632,6 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	    unsigned int flags);
 void initialize_repository_version(int hash_algo);
 
-void sanitize_stdfds(void);
-int daemonize(void);
-
 #define alloc_nr(x) (((x)+16)*3/2)
 
 /**
diff --git a/daemon.h b/daemon.h
new file mode 100644
index 00000000000..c1e50a20354
--- /dev/null
+++ b/daemon.h
@@ -0,0 +1,7 @@
+#ifndef DAEMON_H__
+#define DAEMON_H__
+
+void sanitize_stdfds(void);
+int daemonize(void);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 15/15] runjob: customize the loose-objects batch size
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
  15 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Allow a user to override the default number of loose objects to
place into a new pack-file as part of the loose-objects job. This
can be done via the job.loose-objects.batchSize config option or
the --batch-size=<count> option in the 'git run-job' command. The
config value is checked once per run of 'git run-job loose-objects'
so an instance started by 'git job-runner' will use new values
automatically without restarting the 'git job-runner' process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/job.txt  |  6 ++++++
 Documentation/git-run-job.txt |  8 +++++---
 builtin/run-job.c             | 31 ++++++++++++++++++++++++++-----
 3 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
index 6c22a40dd36..baa5b927e14 100644
--- a/Documentation/config/job.txt
+++ b/Documentation/config/job.txt
@@ -16,6 +16,12 @@ job.<job-name>.lastRun::
 	can manually update this to a later time to delay a specific
 	job on this repository.
 
+job.loose-objects.batchSize::
+	This string value `<count>` limits the number of loose-objects
+	collected into a single pack-file during the `loose-objects`
+	job. Default batch size is fifty thousand. See linkgit:git-run-job[1]
+	for more details.
+
 job.pack-files.batchSize::
 	This string value `<size>` will be passed to the
 	`git multi-pack-index repack --batch-size=<size>` command as
diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index c6d5674d699..73210791533 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -67,9 +67,11 @@ commands, it follows a two-step process. First, it deletes any loose
 objects that already exist in a pack-file; concurrent Git processes will
 examine the pack-file for the object data instead of the loose object.
 Second, it creates a new pack-file (starting with "loose-") containing
-a batch of loose objects. The batch size is limited to 50 thousand
-objects to prevent the job from taking too long on a repository with
-many loose objects.
+a batch of loose objects.
++
+By default, the batch size is limited to 50 thousand objects to prevent
+the job from taking too long on a repository with many loose objects.
+This can be overridden with the `--batch-size=<count>` option.
 
 'pack-files'::
 
diff --git a/builtin/run-job.c b/builtin/run-job.c
index 76765535e09..b7c5a74cdbb 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -13,6 +13,11 @@ static char const * const builtin_run_job_usage[] = {
 	NULL
 };
 
+static char const * const builtin_run_job_loose_objects_usage[] = {
+	N_("git run-job loose-objects [--batch-size=<count>]"),
+	NULL
+};
+
 static char const * const builtin_run_job_pack_file_usage[] = {
 	N_("git run-job pack-files [--batch-size=<size>]"),
 	NULL
@@ -183,7 +188,7 @@ static int write_loose_object_to_stdin(const struct object_id *oid,
 	return ++(d->count) > d->batch_size;
 }
 
-static int pack_loose(void)
+static int pack_loose(int batch_size)
 {
 	int result = 0;
 	struct write_loose_object_data data;
@@ -219,7 +224,7 @@ static int pack_loose(void)
 
 	data.in = xfdopen(pack_proc->in, "w");
 	data.count = 0;
-	data.batch_size = 50000;
+	data.batch_size = batch_size;
 
 	for_each_loose_file_in_objdir(the_repository->objects->odb->path,
 				      write_loose_object_to_stdin,
@@ -240,9 +245,25 @@ static int pack_loose(void)
 	return result;
 }
 
-static int run_loose_objects_job(void)
+static int run_loose_objects_job(int argc, const char **argv)
 {
-	return prune_packed() || pack_loose();
+	static int batch_size;
+	static struct option builtin_run_job_loose_objects_options[] = {
+		OPT_INTEGER(0, "batch-size", &batch_size,
+			    N_("specify the maximum number of loose objects to store in a pack-file")),
+		OPT_END(),
+	};
+
+	if (repo_config_get_int(the_repository,
+				"job.loose-objects.batchsize",
+				&batch_size))
+		batch_size = 50000;
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_run_job_loose_objects_options,
+			     builtin_run_job_loose_objects_usage, 0);
+
+	return prune_packed() || pack_loose(batch_size);
 }
 
 static int multi_pack_index_write(void)
@@ -427,7 +448,7 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 		if (!strcmp(argv[0], "fetch"))
 			return run_fetch_job();
 		if (!strcmp(argv[0], "loose-objects"))
-			return run_loose_objects_job();
+			return run_loose_objects_job(argc, argv);
 		if (!strcmp(argv[0], "pack-files"))
 			return run_pack_files_job(argc, argv);
 	}
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
@ 2020-04-03 21:40 ` Junio C Hamano
  2020-04-04  0:16   ` Derrick Stolee
  15 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-04-03 21:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  * git run-job <job-name>: This builtin will run a single instance of a
>    maintenance job.
>    
>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
>    that executes git run-job as a subcommand.

What does this have to do with "git", though?  IOW, why does this
have to be part of Git, so that those who would benefit from having
a mechanism that makes it easy to run regular maintenance tasks but
are not Git users (or those that want to do such maintenance tasks
that are not necessarily tied to "git") must use "git" to do so?

I'll find out later why it is so after reading thru 15 patches
myself, so no need to give a quick answer to the above; it was just
my knee-jerk reaction.

Thanks.





^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
@ 2020-04-04  0:16   ` Derrick Stolee
  2020-04-07  0:50     ` Danh Doan
  2020-04-07  1:48     ` brian m. carlson
  0 siblings, 2 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-04  0:16 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Derrick Stolee

On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>  * git run-job <job-name>: This builtin will run a single instance of a
>>    maintenance job.
>>    
>>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
>>    that executes git run-job as a subcommand.
> 
> What does this have to do with "git", though?  IOW, why does this
> have to be part of Git, so that those who would benefit from having
> a mechanism that makes it easy to run regular maintenance tasks but
> are not Git users (or those that want to do such maintenance tasks
> that are not necessarily tied to "git") must use "git" to do so?
> 
> I'll find out later why it is so after reading thru 15 patches
> myself, so no need to give a quick answer to the above; it was just
> my knee-jerk reaction.

That's a reasonable reaction. The short version of my reasoning is that
many many people _use_ Git but are not Git experts. While a Git expert
could find the right set of commands to run and at what frequency to
keep their repo clean, most users do not want to spend time learning
these commands. It's also worth our time as contributors to select what
a good set of non-intrusive maintenance tasks could be, and make them
easily accessible to users.

This series gets us half of the way there: a user interested in doing
background maintenance could figure out how to launch "git run-job" on
a schedule for their platform, or to launch "git job-runner" at start-
up. That's a lot simpler than learning how the commit-graph,
multi-pack-index, prune-packed, pack-objects, and fetch builtins work
with the complicated sets of arguments.

The second half would be to create a command such as

	git please-run-maintenance-on-this-repo

that initializes the background jobs and enables them on the repo they
are using. This allows the most casual of Git user to work efficiently
on very large repositories.

Sometimes it is hard to remember that people use Git because it is an
important tool for getting their work done. Time waiting for Git to do
a slow operation or being blocked on a triggered "git gc --auto" is
time they would rather be doing what they want to do. Background
maintenance is a way to reduce the time users spend blocked on Git and
increase their productivity on the more important things.

Of course, I'm biased to using very large repositories where the
existing maintenance process is insufficient. The design of these jobs
is taken directly from what we designed and built for VFS for Git and
Scalar over the winter of 2018-2019. These jobs were incredibly effective
in cleaning up repositories that were accumulating cruft for over a year
without any maintenance. Those repos have stayed clean and we haven't
found more maintenance tasks to be necesary.

I still believe that there are plenty of repos of similar size to the
Linux kernel that are in frequent use and could benefit from these
operations.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
@ 2020-04-05 15:10   ` Phillip Wood
  2020-04-05 19:21     ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Phillip Wood @ 2020-04-05 15:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git run-job' command will be used to execute a short-lived set
> of maintenance activities by a background job manager. The intention
> is to perform small batches of work that reduce the foreground time
> taken by repository maintenance such as 'git gc --auto'.
> 
> This change does the absolute minimum to create the builtin and show
> the usage output.
> 
> Provide an explicit warning that this command is experimental. The
> set of jobs may change, and each job could alter its behavior in
> future versions.
> 
> RFC QUESTION: This builtin is based on the background maintenance in
> Scalar. Specifically, this builtin is based on the "scalar run <job>"
> command [1] [2]. My default thought was to make this a "git run <job>"
> command to maximize similarity. However, it seems like "git run" is
> too generic. Or, am I being overly verbose for no reason?

Having read through this series I wondered if we wanted a single git 
command such as 'git maintenance' (suggestions of better names welcome) 
and then 'git run-job' could become 'git maintenance run', 'git 
job-runner' would become another subcommand (run-jobs or schedule-jobs?) 
and the 'git please-run-maintenance-on-this-repo' you mentioned in you 
email to Junio could become 'git maintenance init' (or maybe setup)

Best Wishes

Phillip

> [1] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#run-maintenance-in-the-foreground
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/RunVerb.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   .gitignore                    |  1 +
>   Documentation/git-run-job.txt | 36 +++++++++++++++++++++++++++++++++++
>   Makefile                      |  1 +
>   builtin.h                     |  1 +
>   builtin/run-job.c             | 28 +++++++++++++++++++++++++++
>   command-list.txt              |  1 +
>   git.c                         |  1 +
>   t/t7900-run-job.sh            | 15 +++++++++++++++
>   8 files changed, 84 insertions(+)
>   create mode 100644 Documentation/git-run-job.txt
>   create mode 100644 builtin/run-job.c
>   create mode 100755 t/t7900-run-job.sh
> 
> diff --git a/.gitignore b/.gitignore
> index 188bd1c3de1..5dea9d3b96b 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -144,6 +144,7 @@
>   /git-rev-parse
>   /git-revert
>   /git-rm
> +/git-run-job
>   /git-send-email
>   /git-send-pack
>   /git-serve
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> new file mode 100644
> index 00000000000..0627b3ed259
> --- /dev/null
> +++ b/Documentation/git-run-job.txt
> @@ -0,0 +1,36 @@
> +git-run-job(1)
> +==============
> +
> +NAME
> +----
> +git-run-job - Run a maintenance job. Intended for background operation.
> +
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'git run-job <job-name>'
> +
> +
> +DESCRIPTION
> +-----------
> +
> +Run a maintenance job on the current repository. This is available as a
> +command for a few reasons. First, the background job feature can launch
> +these commands on a schedule and each process will completely clear its
> +memory when complete. Second, an expert user could create their own job
> +schedule by running these jobs themselves.
> +
> +THIS COMMAND IS EXPERIMENTAL. THE SET OF AVAILABLE JOBS OR THEIR EXACT
> +BEHAVIOR MAY BE ALTERED IN THE FUTURE.
> +
> +
> +JOBS
> +----
> +
> +TBD
> +
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
> diff --git a/Makefile b/Makefile
> index ef1ff2228f0..f5f9c4d9e94 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1125,6 +1125,7 @@ BUILTIN_OBJS += builtin/rev-list.o
>   BUILTIN_OBJS += builtin/rev-parse.o
>   BUILTIN_OBJS += builtin/revert.o
>   BUILTIN_OBJS += builtin/rm.o
> +BUILTIN_OBJS += builtin/run-job.o
>   BUILTIN_OBJS += builtin/send-pack.o
>   BUILTIN_OBJS += builtin/shortlog.o
>   BUILTIN_OBJS += builtin/show-branch.o
> diff --git a/builtin.h b/builtin.h
> index 2b25a80cde3..3e0ddaaf67f 100644
> --- a/builtin.h
> +++ b/builtin.h
> @@ -220,6 +220,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix);
>   int cmd_rev_parse(int argc, const char **argv, const char *prefix);
>   int cmd_revert(int argc, const char **argv, const char *prefix);
>   int cmd_rm(int argc, const char **argv, const char *prefix);
> +int cmd_run_job(int argc, const char **argv, const char *prefix);
>   int cmd_send_pack(int argc, const char **argv, const char *prefix);
>   int cmd_shortlog(int argc, const char **argv, const char *prefix);
>   int cmd_show(int argc, const char **argv, const char *prefix);
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> new file mode 100644
> index 00000000000..2c78d053aa4
> --- /dev/null
> +++ b/builtin/run-job.c
> @@ -0,0 +1,28 @@
> +#include "builtin.h"
> +#include "config.h"
> +#include "parse-options.h"
> +
> +static char const * const builtin_run_job_usage[] = {
> +	N_("git run-job"),
> +	NULL
> +};
> +
> +int cmd_run_job(int argc, const char **argv, const char *prefix)
> +{
> +	static struct option builtin_run_job_options[] = {
> +		OPT_END(),
> +	};
> +
> +	if (argc == 2 && !strcmp(argv[1], "-h"))
> +		usage_with_options(builtin_run_job_usage,
> +				   builtin_run_job_options);
> +
> +	git_config(git_default_config, NULL);
> +	argc = parse_options(argc, argv, prefix,
> +			     builtin_run_job_options,
> +			     builtin_run_job_usage,
> +			     PARSE_OPT_KEEP_UNKNOWN);
> +
> +	usage_with_options(builtin_run_job_usage,
> +			   builtin_run_job_options);
> +}
> diff --git a/command-list.txt b/command-list.txt
> index 20878946558..1cd2b415e46 100644
> --- a/command-list.txt
> +++ b/command-list.txt
> @@ -156,6 +156,7 @@ git-revert                              mainporcelain
>   git-rev-list                            plumbinginterrogators
>   git-rev-parse                           plumbinginterrogators
>   git-rm                                  mainporcelain           worktree
> +git-run-job                             plumbingmanipulators
>   git-send-email                          foreignscminterface             complete
>   git-send-pack                           synchingrepositories
>   git-shell                               synchelpers
> diff --git a/git.c b/git.c
> index b07198fe036..db5a43c8687 100644
> --- a/git.c
> +++ b/git.c
> @@ -566,6 +566,7 @@ static struct cmd_struct commands[] = {
>   	{ "rev-parse", cmd_rev_parse, NO_PARSEOPT },
>   	{ "revert", cmd_revert, RUN_SETUP | NEED_WORK_TREE },
>   	{ "rm", cmd_rm, RUN_SETUP },
> +	{ "run-job", cmd_run_job, RUN_SETUP },
>   	{ "send-pack", cmd_send_pack, RUN_SETUP },
>   	{ "shortlog", cmd_shortlog, RUN_SETUP_GENTLY | USE_PAGER },
>   	{ "show", cmd_show, RUN_SETUP },
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> new file mode 100755
> index 00000000000..1eac80b7ed3
> --- /dev/null
> +++ b/t/t7900-run-job.sh
> @@ -0,0 +1,15 @@
> +#!/bin/sh
> +
> +test_description='git run-job
> +
> +Testing the background jobs, in the foreground
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'help text' '
> +	test_must_fail git run-job -h 2>err &&
> +	test_i18ngrep "usage: git run-job" err
> +'
> +
> +test_done
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
@ 2020-04-05 15:14   ` Phillip Wood
  2020-04-06 12:48     ` Derrick Stolee
  2020-04-05 20:28   ` Junio C Hamano
  2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 1 reply; 57+ messages in thread
From: Phillip Wood @ 2020-04-05 15:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>     the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>     refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>     we can ensure that we actually load the new values somewhere in
>     our refspace while not updating refs/heads or refs/remotes. By
>     storing these refs here, the commit-graph job will update the
>     commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>     longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>     decorate commits with twice as many refs if they appear at a
>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>     there an easy way to exclude a refspace from decorations? Should
>     we make refs/hidden/* a "special" refspace that is excluded from
>     decorations?

Having some way to specify which refs outside of 
refs/{heads,remote,tags}/ to show or exclude from decorations would be 
useful I think. Fetching to a hidden ref is a good idea (as are the 
other steps you outline above) but as you say we don't want it to show 
up in the output of 'git log' etc.

> 2. This feature is designed for a desktop machine or equivalent
>     that has a permanent wired network connection, and the machine
>     stays on while the user is not present. For things like laptops
>     with inconsistent WiFi connections (that may be metered) the
>     feature can use the less stable connection more than the user
>     wants. Of course, this feature is opt-in for Git, but in Scalar
>     we have a "scalar pause" command [2] that pauses all maintenance
>     for some amount of time. We should consider a similar mechanism
>     for Git, but for the point of this series the user needs to set
>     up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/git-run-job.txt | 13 ++++-
>   builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>   t/t7900-run-job.sh            | 22 +++++++++
>   3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>   SYNOPSIS
>   --------
>   [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>   
>   
>   DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>   `commit-graph-chain` file. They will be deleted by a later run based on
>   the expiration delay.
>   
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>   
>   GIT
>   ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>   #include "run-command.h"
>   
>   static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>   	NULL
>   };
>   
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>   	return 1;
>   }
>   
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)

Isn't there a easy way to get this using the config api rather than 
forking 'git remote'?

Best Wishes

Phillip

> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>   int cmd_run_job(int argc, const char **argv, const char *prefix)
>   {
>   	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>   	if (argc > 0) {
>   		if (!strcmp(argv[0], "commit-graph"))
>   			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>   	}
>   
>   	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>   	)
>   '
>   
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>   test_done
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/15] job-runner: load repos from config by default
  2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
@ 2020-04-05 15:18   ` Phillip Wood
  2020-04-06 12:49     ` Derrick Stolee
  2020-04-05 15:41   ` Phillip Wood
  1 sibling, 1 reply; 57+ messages in thread
From: Phillip Wood @ 2020-04-05 15:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.
> 
> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
> 
> Create the job.repo config setting as a multi-valued config option.
> 
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt |  7 +++++++
>   builtin/job-runner.c         | 20 ++++++++++++++++++++
>   2 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
>   	part of `git run-job pack-files`. If not specified, then a
>   	dynamic size calculation is run. See linkgit:git-run-job[1]
>   	for more details.
> +
> +job.repo::
> +	Store an absolute path to a repository that wants background
> +	jobs run by `git job-runner`. This is a multi-valued config
> +	option, but it must be stored in a config file seen by the
> +	background runner. This may be the global or system config
> +	depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>   
>   static int load_active_repos(struct string_list *repos)
>   {
> +	struct string_list_item *item;
> +	const struct string_list *config_repos;
> +
>   	if (arg_repos.nr) {
>   		struct string_list_item *item;
>   		for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
>   		return 0;
>   	}
>   
> +	config_repos = git_config_get_value_multi("job.repo");

Does this really re-read the config files as the commit message suggests 
or just return the cached values? Perhaps the runner could exec itself 
with a --run-jobs option to actually run the jobs so that it sees any 
updated config values.

Best Wishes

Phillip

> +
> +	for (item = config_repos->items;
> +	     item && item < config_repos->items + config_repos->nr;
> +	     item++) {
> +		DIR *dir = opendir(item->string);
> +
> +		if (!dir)
> +			continue;
> +
> +		closedir(dir);
> +		string_list_append(repos, item->string);
> +	}
> +
> +	string_list_sort(repos);
> +	string_list_remove_duplicates(repos, 0);
> +
>   	return 0;
>   }
>   
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 10/15] job-runner: use config to limit job frequency
  2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
@ 2020-04-05 15:24   ` Phillip Wood
  0 siblings, 0 replies; 57+ messages in thread
From: Phillip Wood @ 2020-04-05 15:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> We want to run different maintenance tasks at different rates. That
> means we cannot rely only on the time between job-runner loop pauses
> to reduce the frequency of specific jobs. This means we need to
> persist a timestamp for the previous run somewhere. A natural place
> is the Git config file for that repo.
> 
> Create the job.<job-namme>.lastRun config option to store the
> timestamp of the previous run of that job. Set this config option
> after a successful run of 'git -C <repo> run-job <job-name>'.
> 
> To maximize code reuse, we dynamically construct the config key
> and parse the config value into a timestamp in a generic way. This
> makes the introduction of another config option extremely trivial:
> 
> The job.<job-name>.interval option allows a user to specify a
> minimum number of seconds between two calls to 'git run-job
> <job-name>' on a given repo. This could be stored in the global
> or system config to provide an update on the default for all repos,
> or could be updated on a per-repo basis. This is checked on every
> iteration of the job loop, so a user could update this and see the
> effect without restarting the job-runner process.
> 
> RFC QUESTION: I'm using a 'git -C <repo> config <option>' process
> to test if enough time has elapsed before running the 'run-job'
> process. These 'config' processes are pretty light-weight, so
> hopefully they are not too noisy. An alternative would be to
> always run 'git -C <repo> run-job <job-name>' and rely on that
> process to no-op based on the configured values and how recently
> they were run.

You're still executing another process so it doesn't really save 
anything in the 'noop' case. In the case where something needs to be 
done I think the extra config process is unlikely to be noticed 
(especially as 'git job' forks a lot anyway)

Best Wishes

Phillip

> However, that will likely interrupt users who want
> to test the jobs in the foreground. Finally, that user disruption
> would be mitigated by adding a "--check-latest" option. A user
> running a job manually would not include that argument by default
> and would succeed. However, any logging that we might do for the
> job-runner would make it look like we are running the run-job
> commands all the time even if they are no-ops. Advice welcome!
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt  |  10 ++
>   Documentation/git-run-job.txt |   3 +
>   builtin/job-runner.c          | 177 +++++++++++++++++++++++++++++++++-
>   3 files changed, 189 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index 2dfed3935fa..7c799d66221 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -1,3 +1,13 @@
> +job.<job-name>.interval::
> +	The minimum number of seconds between runs of
> +	`git run-job <job-name>` when running `git job-runner`.
> +
> +job.<job-name>.lastRun::
> +	The Unix epoch for the timestamp of the previous run of
> +	`git run-job <job-name>` when running `git job-runner`. You
> +	can manually update this to a later time to delay a specific
> +	job on this repository.
> +
>   job.pack-files.batchSize::
>   	This string value `<size>` will be passed to the
>   	`git multi-pack-index repack --batch-size=<size>` command as
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index cdd6417f7c9..c6d5674d699 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -90,6 +90,9 @@ pack-files, the job does not attempt to repack. Otherwise, the batch
>   size is the sum of all pack-file sizes minus the largest pack-file size.
>   The batch size is capped at two gigabytes. This intends to pack all
>   small pack-files into a single pack-file.
> ++
> +The `--batch-size=<size>` option will override the dynamic or configured
> +batch size.
>   
>   
>   GIT
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index bed401f94bf..aee55c106e8 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -4,11 +4,175 @@
>   #include "run-command.h"
>   #include "string-list.h"
>   
> +#define MAX_TIMESTAMP ((timestamp_t)-1)
> +
>   static char const * const builtin_job_runner_usage[] = {
>   	N_("git job-runner [<options>]"),
>   	NULL
>   };
>   
> +static char *config_name(const char *prefix,
> +			 const char *job,
> +			 const char *postfix)
> +{
> +	int postfix_dot = 0;
> +	struct strbuf name = STRBUF_INIT;
> +
> +	if (prefix) {
> +		strbuf_addstr(&name, prefix);
> +		strbuf_addch(&name, '.');
> +	}
> +
> +	if (job) {
> +		strbuf_addstr(&name, job);
> +		postfix_dot = 1;
> +	}
> +
> +	if (postfix) {
> +		if (postfix_dot)
> +			strbuf_addch(&name, '.');
> +		strbuf_addstr(&name, postfix);
> +	}
> +
> +	return strbuf_detach(&name, NULL);
> +}
> +
> +static int try_get_config(const char *job,
> +			  const char *repo,
> +			  const char *postfix,
> +			  char **value)
> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	char *cname = config_name("job", job, postfix);
> +	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> +
> +	child_process_init(config_proc);
> +
> +	argv_array_push(&config_proc->args, "git");
> +	argv_array_push(&config_proc->args, "-C");
> +	argv_array_push(&config_proc->args, repo);
> +	argv_array_push(&config_proc->args, "config");
> +	argv_array_push(&config_proc->args, cname);
> +	free(cname);
> +
> +	config_proc->out = -1;
> +
> +	if (start_command(config_proc)) {
> +		warning(_("failed to start process for repo '%s'"),
> +			repo);
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(config_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	if (!strbuf_getline(&line, proc_out))
> +		*value = strbuf_detach(&line, NULL);
> +	else
> +		*value = NULL;
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	result = finish_command(config_proc);
> +
> +cleanup:
> +	free(config_proc);
> +	return result;
> +}
> +
> +static int try_get_timestamp(const char *job,
> +			     const char *repo,
> +			     const char *postfix,
> +			     timestamp_t *t)
> +{
> +	char *value;
> +	int result = try_get_config(job, repo, postfix, &value);
> +
> +	if (!result) {
> +		*t = atol(value);
> +		free(value);
> +	}
> +
> +	return result;
> +}
> +
> +static timestamp_t get_last_run(const char *job,
> +				const char *repo)
> +{
> +	timestamp_t last_run = 0;
> +
> +	/* In an error state, do not run the job */
> +	if (try_get_timestamp(job, repo, "lastrun", &last_run))
> +		return MAX_TIMESTAMP;
> +
> +	return last_run;
> +}
> +
> +static timestamp_t get_interval(const char *job,
> +				const char *repo)
> +{
> +	timestamp_t interval = MAX_TIMESTAMP;
> +
> +	/* In an error state, do not run the job */
> +	if (try_get_timestamp(job, repo, "interval", &interval))
> +		return MAX_TIMESTAMP;
> +
> +	if (interval == MAX_TIMESTAMP) {
> +		/* load defaults for each job */
> +		if (!strcmp(job, "commit-graph") || !strcmp(job, "fetch"))
> +			interval = 60 * 60; /* 1 hour */
> +		else
> +			interval = 24 * 60 * 60; /* 1 day */
> +	}
> +
> +	return interval;
> +}
> +
> +static int set_last_run(const char *job,
> +			const char *repo,
> +			timestamp_t last_run)
> +{
> +	int result = 0;
> +	struct strbuf last_run_string = STRBUF_INIT;
> +	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> +	char *cname = config_name("job", job, "lastrun");
> +
> +	strbuf_addf(&last_run_string, "%"PRItime, last_run);
> +
> +	child_process_init(config_proc);
> +
> +	argv_array_push(&config_proc->args, "git");
> +	argv_array_push(&config_proc->args, "-C");
> +	argv_array_push(&config_proc->args, repo);
> +	argv_array_push(&config_proc->args, "config");
> +	argv_array_push(&config_proc->args, cname);
> +	argv_array_push(&config_proc->args, last_run_string.buf);
> +	free(cname);
> +	strbuf_release(&last_run_string);
> +
> +	if (start_command(config_proc)) {
> +		warning(_("failed to start process for repo '%s'"),
> +			repo);
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	if (finish_command(config_proc)) {
> +		warning(_("failed to finish process for repo '%s'"),
> +			repo);
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(config_proc);
> +	return result;
> +}
> +
>   static struct string_list arg_repos = STRING_LIST_INIT_DUP;
>   
>   static int arg_repos_append(const struct option *opt,
> @@ -54,9 +218,20 @@ static int load_active_repos(struct string_list *repos)
>   
>   static int run_job(const char *job, const char *repo)
>   {
> +	int result;
>   	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	timestamp_t now = time(NULL);
> +	timestamp_t last_run = get_last_run(job, repo);
> +	timestamp_t interval = get_interval(job, repo);
> +
> +	if (last_run + interval > now)
> +		return 0;
> +
>   	argv_array_pushl(&cmd, "-C", repo, "run-job", job, NULL);
> -	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	set_last_run(job, repo, now);
> +	return result;
>   }
>   
>   static int run_job_loop_step(struct string_list *list)
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/15] job-runner: load repos from config by default
  2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
  2020-04-05 15:18   ` Phillip Wood
@ 2020-04-05 15:41   ` Phillip Wood
  2020-04-06 12:57     ` Derrick Stolee
  1 sibling, 1 reply; 57+ messages in thread
From: Phillip Wood @ 2020-04-05 15:41 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.

One thought that occurred to me was that it might be convenient to put 
'git job-runner $HOME &' in my .bashrc to have it start on login and 
find all the repositories under $HOME without me having to list each one 
(and remember to update the list each time a clone a new repository). 
There are a couple of issues with this
  1 - We only want to run the jobs once per repo, not for each worktree. 
That should be easy enough to implement by checking if we're in the main 
worktree.
  2 - If I start several shells I only want one instance of 'git 
job-runner' to start. One way to handle that would be for the runner to 
hold a lock file in the directory it's given and quit if the lock is 
already taken. It should probably walk up the directory hierarchy to 
check for lock files as well in case I try to start another job-runner 
instance in a subdirectory.

Best Wishes

Phillip


> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
> 
> Create the job.repo config setting as a multi-valued config option.
> 
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt |  7 +++++++
>   builtin/job-runner.c         | 20 ++++++++++++++++++++
>   2 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
>   	part of `git run-job pack-files`. If not specified, then a
>   	dynamic size calculation is run. See linkgit:git-run-job[1]
>   	for more details.
> +
> +job.repo::
> +	Store an absolute path to a repository that wants background
> +	jobs run by `git job-runner`. This is a multi-valued config
> +	option, but it must be stored in a config file seen by the
> +	background runner. This may be the global or system config
> +	depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>   
>   static int load_active_repos(struct string_list *repos)
>   {
> +	struct string_list_item *item;
> +	const struct string_list *config_repos;
> +
>   	if (arg_repos.nr) {
>   		struct string_list_item *item;
>   		for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
>   		return 0;
>   	}
>   
> +	config_repos = git_config_get_value_multi("job.repo");
> +
> +	for (item = config_repos->items;
> +	     item && item < config_repos->items + config_repos->nr;
> +	     item++) {
> +		DIR *dir = opendir(item->string);
> +
> +		if (!dir)
> +			continue;
> +
> +		closedir(dir);
> +		string_list_append(repos, item->string);
> +	}
> +
> +	string_list_sort(repos);
> +	string_list_remove_duplicates(repos, 0);
> +
>   	return 0;
>   }
>   
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-05 15:10   ` Phillip Wood
@ 2020-04-05 19:21     ` Junio C Hamano
  2020-04-06 14:42       ` Derrick Stolee
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-04-05 19:21 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder, stolee,
	Derrick Stolee

Phillip Wood <phillip.wood123@gmail.com> writes:

> Hi Stolee
>
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The 'git run-job' command will be used to execute a short-lived set
>> of maintenance activities by a background job manager. The intention
>> is to perform small batches of work that reduce the foreground time
>> taken by repository maintenance such as 'git gc --auto'.
>>
>> This change does the absolute minimum to create the builtin and show
>> the usage output.
>>
>> Provide an explicit warning that this command is experimental. The
>> set of jobs may change, and each job could alter its behavior in
>> future versions.
>>
>> RFC QUESTION: This builtin is based on the background maintenance in
>> Scalar. Specifically, this builtin is based on the "scalar run <job>"
>> command [1] [2]. My default thought was to make this a "git run <job>"
>> command to maximize similarity. However, it seems like "git run" is
>> too generic. Or, am I being overly verbose for no reason?
>
> Having read through this series I wondered if we wanted a single git
> command such as 'git maintenance' (suggestions of better names
> welcome) and then 'git run-job' could become 'git maintenance run',
> 'git job-runner' would become another subcommand (run-jobs or
> schedule-jobs?) and the 'git please-run-maintenance-on-this-repo' you
> mentioned in you email to Junio could become 'git maintenance init'
> (or maybe setup)

I had a very similar impression.  In addition to what you already
said, a few more were:

 - Why the existing "git repack" isn't such "maintenance" command?
   IOW why do we even need [01/15]?  After all, "repack" may have
   started its life as a tool to reorganize the PACKFILES, but it is
   no longer limited to 'git/objects/pack/*.pack' files with its
   knowledge about the loose object files and the "--prune" option.
   Consolidating pieces of information spread across multiple .idx
   files, reachability bitmaps and commit graph files, into a newer
   and more performant forms can just be part of "packing the pieces
   of information in a repository for optimum performance", which is
   a better way to understand why "repack" has a word 'pack' in its
   name.

 - Many of the "maintenance" operations this series proposes do make
   sense, just like other "maintenance" operations we already have
   in "repack", "prune", "prune-packed" etc., which are welcome
   additions. 

 - Like the individual steps that appear in e.g. "repack", however,
   some of the individual steps in this series can be triggered by
   calling underlying tools directly, allowing scripted maintenance
   commands that suit individual needs better than the canned
   invocation of "run-job", but I didn't get the impression that the
   series strives to make sure that all knobs of these individual
   steps are available to scripters who want to deviate from what
   "run-job" prescribes.  If it is not doing so, we probably should.

 - Again, I do not think we want a reimplementation of cron, at or
   inetd that is not specific to "git" at all.

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
@ 2020-04-05 20:28   ` Junio C Hamano
  2020-04-06 12:46     ` Derrick Stolee
  2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-04-05 20:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations?

I do not think there is, but it makes sense to teach the decoration
machinery to either use only certain refs hierarchies or use all
hierarchies except for certain ones; if we want to make sure we
won't break existing workflows, we should by defautlt use all the
refs we currently use and nothing else, but over time we probably
would want to migrate the default to cover only the local and
remote-tracking branches and tags (and at that point, refs/hidden
would be outside the decoration source).

By the way, I have a moderately strong opinion against the use of
"refs/hidden" for the purpose of "prefetch in advance, without
disrupting refs/remotes".  There may be more than one reason why
some refs want to be "hidden", and depending on the purose, the
exact refs that one workflow (e.g. "decorate") wants to hide may be
the ones that want to be exposed.

If we rename it to "refs/prefetch/", would it make the purpose of
the hierarchy clearer without squatting on a vague (because it does
not tell why it is hidden) name "hidden" that other people might
want to use to hide their stuff for different reasons?

> Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?

See above.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04/15] run-job: implement loose-objects job
  2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
@ 2020-04-05 20:33   ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-04-05 20:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Create a 'loose-objects' job for the 'git run-job' command. This
> helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:

This is exactly the kind of thing I would have expected you'd be
adding to "git repack".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-05 20:28   ` Junio C Hamano
@ 2020-04-06 12:46     ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:46 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Derrick Stolee

On 4/5/2020 4:28 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>    decorate commits with twice as many refs if they appear at a
>>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>    there an easy way to exclude a refspace from decorations?
> 
> I do not think there is, but it makes sense to teach the decoration
> machinery to either use only certain refs hierarchies or use all
> hierarchies except for certain ones; if we want to make sure we
> won't break existing workflows, we should by defautlt use all the
> refs we currently use and nothing else, but over time we probably
> would want to migrate the default to cover only the local and
> remote-tracking branches and tags (and at that point, refs/hidden
> would be outside the decoration source).

I'll see what I can do about adding config to remove some refs from
decorations. That is immediately useful for Scalar users, too.

> By the way, I have a moderately strong opinion against the use of
> "refs/hidden" for the purpose of "prefetch in advance, without
> disrupting refs/remotes".  There may be more than one reason why
> some refs want to be "hidden", and depending on the purose, the
> exact refs that one workflow (e.g. "decorate") wants to hide may be
> the ones that want to be exposed.
> 
> If we rename it to "refs/prefetch/", would it make the purpose of
> the hierarchy clearer without squatting on a vague (because it does
> not tell why it is hidden) name "hidden" that other people might
> want to use to hide their stuff for different reasons?

I like "refs/prefetch". That's more descriptive.

>> Should
>>    we make refs/hidden/* a "special" refspace that is excluded from
>>    decorations?
> 
> See above.
Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-05 15:14   ` Phillip Wood
@ 2020-04-06 12:48     ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:48 UTC (permalink / raw)
  To: phillip.wood, Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, Derrick Stolee

On 4/5/2020 11:14 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>> RFC QUESTIONS:
>>
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>     decorate commits with twice as many refs if they appear at a
>>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>     there an easy way to exclude a refspace from decorations? Should
>>     we make refs/hidden/* a "special" refspace that is excluded from
>>     decorations?
> 
> Having some way to specify which refs outside of refs/{heads,remote,tags}/ to show or exclude from decorations would be useful I think. Fetching to a hidden ref is a good idea (as are the other steps you outline above) but as you say we don't want it to show up in the output of 'git log' etc.

I'll work on this first. It seems less controversial.

>> +static int fill_remotes(struct string_list *remotes)
> 
> Isn't there a easy way to get this using the config api rather than forking 'git remote'?

You're right. I should have used the config here. I've been overly
biased to how we do it in the C# layer of Scalar _plus_ how I wanted
the job-runner to run "outside" a Git repository. This could be much
simpler with the config API.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/15] job-runner: load repos from config by default
  2020-04-05 15:18   ` Phillip Wood
@ 2020-04-06 12:49     ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:49 UTC (permalink / raw)
  To: phillip.wood, Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, Derrick Stolee

On 4/5/2020 11:18 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>>   +    config_repos = git_config_get_value_multi("job.repo");
> 
> Does this really re-read the config files as the commit message suggests or just return the cached values? Perhaps the runner could exec itself with a --run-jobs option to actually run the jobs so that it sees any updated config values.

You're right, I need to double-check that. I'm guessing that you are
right and it uses the cached values. This should use the run_command()
pattern to ensure the config is as new as possible.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09/15] job-runner: load repos from config by default
  2020-04-05 15:41   ` Phillip Wood
@ 2020-04-06 12:57     ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:57 UTC (permalink / raw)
  To: phillip.wood, Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, Derrick Stolee

On 4/5/2020 11:41 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The 'git job-runner' command was introduced with a '--repo=<path>'
>> option so a user could start a process with an explicit list of
>> repos. However, it is more likely that we will want 'git
>> job-runner' to start without any arguments and discover the set of
>> repos from another source.
> 
> One thought that occurred to me was that it might be convenient to put
> 'git job-runner $HOME &' in my .bashrc to have it start on login and
> find all the repositories under $HOME without me having to list each
> one (and remember to update the list each time a clone a new repository).
> There are a couple of issues with this
> 1 - We only want to run the jobs once per repo, not for each worktree.
> That should be easy enough to implement by checking if we're in the
> main worktree.

This idea of "please check all (immediate) directories under <dir> for
repositories to run maintenance" is an interesting one. It certainly
could be added later.

Your concern about worktrees is actually not a big deal. Since we check
the config for the "last run" of a job on a repo, we will not run the
same job immediately on a worktree after running it on the original
(unless the interval times are incredibly short).

> 2 - If I start several shells I only want one instance of 'git
> job-runner' to start. One way to handle that would be for the runner
> to hold a lock file in the directory it's given and quit if the lock
> is already taken. It should probably walk up the directory hierarchy
> to check for lock files as well in case I try to start another
> job-runner instance in a subdirectory.

This is an interesting idea. My gut reaction is that we don't want
to prevent users from running multiple processes if they want to. But
if they run the process twice in the same directory then they are
likely to be running on the same set of repos (barring "-c jobs.repo"
or equivalent).

Again, my hope is that we would solve this problem by having a cross-
platform "job service" that makes the user setup of editing .bashrc
irrelevant. Not only is that idea getting push back, we should allow
expert users the ability to customize these steps to their delight.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-05 19:21     ` Junio C Hamano
@ 2020-04-06 14:42       ` Derrick Stolee
  2020-04-07  0:58         ` Danh Doan
  0 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee @ 2020-04-06 14:42 UTC (permalink / raw)
  To: Junio C Hamano, Phillip Wood
  Cc: Derrick Stolee via GitGitGadget, git, peff, jrnieder, Derrick Stolee

On 4/5/2020 3:21 PM, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> Hi Stolee
>>
>> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The 'git run-job' command will be used to execute a short-lived set
>>> of maintenance activities by a background job manager. The intention
>>> is to perform small batches of work that reduce the foreground time
>>> taken by repository maintenance such as 'git gc --auto'.
>>>
>>> This change does the absolute minimum to create the builtin and show
>>> the usage output.
>>>
>>> Provide an explicit warning that this command is experimental. The
>>> set of jobs may change, and each job could alter its behavior in
>>> future versions.
>>>
>>> RFC QUESTION: This builtin is based on the background maintenance in
>>> Scalar. Specifically, this builtin is based on the "scalar run <job>"
>>> command [1] [2]. My default thought was to make this a "git run <job>"
>>> command to maximize similarity. However, it seems like "git run" is
>>> too generic. Or, am I being overly verbose for no reason?
>>
>> Having read through this series I wondered if we wanted a single git
>> command such as 'git maintenance' (suggestions of better names
>> welcome) and then 'git run-job' could become 'git maintenance run',
>> 'git job-runner' would become another subcommand (run-jobs or
>> schedule-jobs?) and the 'git please-run-maintenance-on-this-repo' you
>> mentioned in you email to Junio could become 'git maintenance init'
>> (or maybe setup)
> 
> I had a very similar impression.  In addition to what you already
> said, a few more were:
> 
>  - Why the existing "git repack" isn't such "maintenance" command?
>    IOW why do we even need [01/15]?  After all, "repack" may have
>    started its life as a tool to reorganize the PACKFILES, but it is
>    no longer limited to 'git/objects/pack/*.pack' files with its
>    knowledge about the loose object files and the "--prune" option.
>    Consolidating pieces of information spread across multiple .idx
>    files, reachability bitmaps and commit graph files, into a newer
>    and more performant forms can just be part of "packing the pieces
>    of information in a repository for optimum performance", which is
>    a better way to understand why "repack" has a word 'pack' in its
>    name.

To me, "git repack" is a specific kind of maintenance. The end result
is a pack-file. Now, "git gc" is a bit more general, because it will
create a pack-file but also update the commit-graph file. Still, its
name is still very specific: it "collects garbage". The goals of this
series are to replace "git gc --auto" with something less invasive.

I'll include an alternate CLI proposal at the end of this message.

>  - Many of the "maintenance" operations this series proposes do make
>    sense, just like other "maintenance" operations we already have
>    in "repack", "prune", "prune-packed" etc., which are welcome
>    additions. 

Thanks. I'm glad these steps make sense. They are definitely more
"incremental" updates than a full repack or GC.
 
>  - Like the individual steps that appear in e.g. "repack", however,
>    some of the individual steps in this series can be triggered by
>    calling underlying tools directly, allowing scripted maintenance
>    commands that suit individual needs better than the canned
>    invocation of "run-job", but I didn't get the impression that the
>    series strives to make sure that all knobs of these individual
>    steps are available to scripters who want to deviate from what
>    "run-job" prescribes.  If it is not doing so, we probably should.
> 
>  - Again, I do not think we want a reimplementation of cron, at or
>    inetd that is not specific to "git" at all.

I expected the job-runner to get some push-back. The design for it in
the current RFC matched how we do it in Scalar more than anything else.
You're probably right that it would be better to leave the "background"
part to the platform.

Of course, not every platform has "cron" but that just means we need a
cross-platform way to launch Git processes on some schedule. That could
be a command that creates a cron job on platforms that have it, and on
Windows it could create a scheduled task instead.

But what should we launch? It should probably be a Git command that
checks config for a list of repositories, then runs "the maintenance
command" on each of those repos.

I'm inserting a break here to draw the eye to a new proposed design:

---

Create a "git maintenance" builtin. This has a few subcommands:

1. "run" will run the configured maintenance on the current repo. This
   should become the single entry point for users to say "please clean
   up my repo." What _exactly_ it does can be altered with config. I'll
   list some possibilities after listing the subcommands.

2. "run-on-repos" uses command-line arguments or config to launch "git
   -C <dir> maintenance run" for all configured directories. The
   intention is that this is launched on some schedule by a platform-
   specific scheduling mechanism (i.e. cron).
   (This subcommand could use a better name.)

3. "schedule" adds the current repository to the configured list of
   repositories for running with "run-on-repos". It will also initialize
   the platform-specific scheduling mechanism. This may be to start the
   schedule for the first time OR to update how frequent "run-on-repos"
   is run, as appropriate.

4. (OPTIONAL) "mode <mode>" adjusts the config for the current repo to
   change the type of maintenance requested for this repo. For example,
   "simple" could just run "git gc --auto" using a normal range.
   "incremental" could run the maintenance tasks from this series.
   Finally, "server" could run maintenance tasks as if we are serving
   the repo to others, so we repack aggressively with full bitmaps, and
   more frequently.

Here are some possible maintenance tasks. Not all of them would
be appropriate to run on the same repo, or at least not with the
same frequency:

* "fetch" : the background fetch from PATCH 3. Appropriate for all modes,
  but perhaps would want users to opt-in to this in the  basic mode.

* "commit-graph" : the incremental commit-graph writes from PATCH 2.
  Appropriate whenever the "fetch" command is being run, but also
  valuable for the "server" mode.

* "gc" : Run "git gc --auto". This would be enabled by default, but
  should be disabled for the "incremental" and "server" modes.

* "repack" : Run "git repack <options>" with appropriate options based
  on config. The "server" mode would include custom delta and bitmap
  options. (I will leave the specifics to those who maintain servers to
  recommend the best options for "server" mode.)

* "loose-objects" : see PATCH 4. Appropriate for "incremental" mode.

* "multi-pack-index" or "incremental-repack" : Run the "pack-files" job
  from PATCH 5. Appropriate for "incremental" mode.

* "pack-refs" : create a packed-refs file or repack the reftable as
  appropriate for those features. (I have less familiarity with these.)

Notice that with this new set of options we could do something rather
dramatic: replace all calls to "git gc --auto" with "git maintenance
run --auto". By default, these would be equivalent. However, "git
maintenance run --auto" is more clear that the behavior is less specific
than "git gc" and could be configured to do something different.

I used an "--auto" option in the suggestion above to help distinguish
between the command being run as a foreground operation instead of a
background operation. Part of setting up a schedule would include
disabling these "foreground" maintenance tasks and relying entirely on
the background tasks instead. The best situation would be to avoid
launching the subprocess at all.

---

What do people think of this alternative? Does this get us closer to an
appropriate level of work for Git to do?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-04  0:16   ` Derrick Stolee
@ 2020-04-07  0:50     ` Danh Doan
  2020-04-07 10:59       ` Derrick Stolee
  2020-04-07  1:48     ` brian m. carlson
  1 sibling, 1 reply; 57+ messages in thread
From: Danh Doan @ 2020-04-07  0:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > 
> >>  * git run-job <job-name>: This builtin will run a single instance of a
> >>    maintenance job.
> >>    
> >>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> >>    that executes git run-job as a subcommand.
> > 
> > What does this have to do with "git", though?  IOW, why does this
> > have to be part of Git, so that those who would benefit from having
> > a mechanism that makes it easy to run regular maintenance tasks but
> > are not Git users (or those that want to do such maintenance tasks
> > that are not necessarily tied to "git") must use "git" to do so?

I also agree with Junio,
I don't think Git should be responsible to be a scheduler.
It's the job of either tranditional crontab, at on *nix, or scheduler
on Windows.

> That's a reasonable reaction. The short version of my reasoning is that
> many many people _use_ Git but are not Git experts. While a Git expert
> could find the right set of commands to run and at what frequency to
> keep their repo clean, most users do not want to spend time learning
> these commands. It's also worth our time as contributors to select what

And now, people will need to learn _both_ Git existing maintainance
command, and new scheduler (Do I understand it right?, I haven't go
through all patches)

Yes, it could be a setup it once and forget, but,
if there's a problem with their local repo, they will scratch
their head to understand what wrong with them.

It's easier to destroy their repo, and it's harder to know what
operation is running in their computer.

-- 
Danh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-06 14:42       ` Derrick Stolee
@ 2020-04-07  0:58         ` Danh Doan
  2020-04-07 10:54           ` Derrick Stolee
  0 siblings, 1 reply; 57+ messages in thread
From: Danh Doan @ 2020-04-07  0:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Phillip Wood, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

On 2020-04-06 10:42:23-0400, Derrick Stolee <stolee@gmail.com> wrote:
> Of course, not every platform has "cron" but that just means we need a
> cross-platform way to launch Git processes on some schedule. That could
> be a command that creates a cron job on platforms that have it, and on

There's Unix system that doesn't have cron.
People could use other scheduler mechanism.

A lot of systemd users uses systemd-timer.
I'm using snooze.
Each of those set of utilities have different grammar and
configuration.

> Windows it could create a scheduled task instead.
> 
> But what should we launch? It should probably be a Git command that
> checks config for a list of repositories, then runs "the maintenance
> command" on each of those repos.
> 
> I'm inserting a break here to draw the eye to a new proposed design:
> 
> ---
> 
> Create a "git maintenance" builtin. This has a few subcommands:
> 
> 1. "run" will run the configured maintenance on the current repo. This
>    should become the single entry point for users to say "please clean
>    up my repo." What _exactly_ it does can be altered with config. I'll
>    list some possibilities after listing the subcommands.
> 
> 2. "run-on-repos" uses command-line arguments or config to launch "git
>    -C <dir> maintenance run" for all configured directories. The
>    intention is that this is launched on some schedule by a platform-
>    specific scheduling mechanism (i.e. cron).

So, IIUC, Git will have a _hard_ dependencies on cron on *nix?
Else, we're gonna received a bug-report that some tools doesn't work?

I've seen some bug report in our distro that "git add -p" doesn't work
like documented, because it's in "git-perl" packages.
When we merge "git-perl" back to git, other people (who never use
"git add -p" and git-sendemail) complain why does we add a hard dependencies
on perl to git.

>    (This subcommand could use a better name.)
> 
> 3. "schedule" adds the current repository to the configured list of
>    repositories for running with "run-on-repos". It will also initialize
>    the platform-specific scheduling mechanism. This may be to start the
>    schedule for the first time OR to update how frequent "run-on-repos"
>    is run, as appropriate.
> 
> 4. (OPTIONAL) "mode <mode>" adjusts the config for the current repo to
>    change the type of maintenance requested for this repo. For example,
>    "simple" could just run "git gc --auto" using a normal range.
>    "incremental" could run the maintenance tasks from this series.
>    Finally, "server" could run maintenance tasks as if we are serving
>    the repo to others, so we repack aggressively with full bitmaps, and
>    more frequently.

-- 
Danh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-04  0:16   ` Derrick Stolee
  2020-04-07  0:50     ` Danh Doan
@ 2020-04-07  1:48     ` brian m. carlson
  2020-04-07 20:08       ` Junio C Hamano
  2020-04-07 22:23       ` Johannes Schindelin
  1 sibling, 2 replies; 57+ messages in thread
From: brian m. carlson @ 2020-04-07  1:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 3571 bytes --]

On 2020-04-04 at 00:16:21, Derrick Stolee wrote:
> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > 
> >>  * git run-job <job-name>: This builtin will run a single instance of a
> >>    maintenance job.
> >> 
> >>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> >>    that executes git run-job as a subcommand.
> > 
> > What does this have to do with "git", though?  IOW, why does this
> > have to be part of Git, so that those who would benefit from having
> > a mechanism that makes it easy to run regular maintenance tasks but
> > are not Git users (or those that want to do such maintenance tasks
> > that are not necessarily tied to "git") must use "git" to do so?
> > 
> > I'll find out later why it is so after reading thru 15 patches
> > myself, so no need to give a quick answer to the above; it was just
> > my knee-jerk reaction.
> 
> That's a reasonable reaction. The short version of my reasoning is that
> many many people _use_ Git but are not Git experts. While a Git expert
> could find the right set of commands to run and at what frequency to
> keep their repo clean, most users do not want to spend time learning
> these commands. It's also worth our time as contributors to select what
> a good set of non-intrusive maintenance tasks could be, and make them
> easily accessible to users.
> 
> This series gets us half of the way there: a user interested in doing
> background maintenance could figure out how to launch "git run-job" on
> a schedule for their platform, or to launch "git job-runner" at start-
> up. That's a lot simpler than learning how the commit-graph,
> multi-pack-index, prune-packed, pack-objects, and fetch builtins work
> with the complicated sets of arguments.

If there are periodic tasks that should be done, even if only on large
repos, then let's have a git gc --periodic that does them.  I'm not sure
that fetch should be in that set, but nothing prevents users from doing
"git fetch origin && git gc --periodic".  Let's make it as simple and
straightforward as possible.

As for handling multiple repositories, the tool to do that could be as
simple as a shell script which reads from ~/.config/git/repo-maintenance
(or whatever) and runs the same command on all of the repos it finds
there, possibly with a subcommand to add and remove repos.

> The second half would be to create a command such as
> 
> 	git please-run-maintenance-on-this-repo
> 
> that initializes the background jobs and enables them on the repo they
> are using. This allows the most casual of Git user to work efficiently
> on very large repositories.

I'm not opposed to seeing a tool that can schedule periodic maintenance
jobs, perhaps in contrib, depending on whether other people think it
should go.  However, I think running periodic jobs is best handled on
Unix with cron or anacron and not a custom tool or a command in Git.

I've dealt with systems that implemented periodic tasks without using
the existing tools for doing that, and I've found that usually that's a
mistake.  Despite seeming straightforward, there are a lot of tricky
edge cases to deal with and it's easy to get wrong.

We also don't have to reimplement all the features in the system
scheduler and can let expert users use a different tool of their choice
instead if cron (or the Windows equivalent) is not to their liking.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-07  0:58         ` Danh Doan
@ 2020-04-07 10:54           ` Derrick Stolee
  2020-04-07 14:16             ` Danh Doan
  0 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee @ 2020-04-07 10:54 UTC (permalink / raw)
  To: Danh Doan
  Cc: Junio C Hamano, Phillip Wood, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

On 4/6/2020 8:58 PM, Danh Doan wrote:
> On 2020-04-06 10:42:23-0400, Derrick Stolee <stolee@gmail.com> wrote:
>> Of course, not every platform has "cron" but that just means we need a
>> cross-platform way to launch Git processes on some schedule. That could
>> be a command that creates a cron job on platforms that have it, and on
> 
> There's Unix system that doesn't have cron.
> People could use other scheduler mechanism.
> 
> A lot of systemd users uses systemd-timer.
> I'm using snooze.

Thanks for listing some alternatives. I'll look into these.

> Each of those set of utilities have different grammar and
> configuration.
> 
>> Windows it could create a scheduled task instead.

>> 2. "run-on-repos" uses command-line arguments or config to launch "git
>>    -C <dir> maintenance run" for all configured directories. The
>>    intention is that this is launched on some schedule by a platform-
>>    specific scheduling mechanism (i.e. cron).
> 
> So, IIUC, Git will have a _hard_ dependencies on cron on *nix?
> Else, we're gonna received a bug-report that some tools doesn't work?

No. Such a dependency would be unacceptable. I'm just using cron
as an example when available.
 
> I've seen some bug report in our distro that "git add -p" doesn't work
> like documented, because it's in "git-perl" packages.
> When we merge "git-perl" back to git, other people (who never use
> "git add -p" and git-sendemail) complain why does we add a hard dependencies
> on perl to git.

Good news: "git add -p" is becoming a builtin with a lot of work by
some determined contributors.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07  0:50     ` Danh Doan
@ 2020-04-07 10:59       ` Derrick Stolee
  2020-04-07 14:26         ` Danh Doan
  0 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee @ 2020-04-07 10:59 UTC (permalink / raw)
  To: Danh Doan
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

On 4/6/2020 8:50 PM, Danh Doan wrote:
> On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
>> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>>  * git run-job <job-name>: This builtin will run a single instance of a
>>>>    maintenance job.
>>>>    
>>>>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
>>>>    that executes git run-job as a subcommand.
>>>
>>> What does this have to do with "git", though?  IOW, why does this
>>> have to be part of Git, so that those who would benefit from having
>>> a mechanism that makes it easy to run regular maintenance tasks but
>>> are not Git users (or those that want to do such maintenance tasks
>>> that are not necessarily tied to "git") must use "git" to do so?
> 
> I also agree with Junio,
> I don't think Git should be responsible to be a scheduler.
> It's the job of either tranditional crontab, at on *nix, or scheduler
> on Windows.
> 
>> That's a reasonable reaction. The short version of my reasoning is that
>> many many people _use_ Git but are not Git experts. While a Git expert
>> could find the right set of commands to run and at what frequency to
>> keep their repo clean, most users do not want to spend time learning
>> these commands. It's also worth our time as contributors to select what
> 
> And now, people will need to learn _both_ Git existing maintainance
> command, and new scheduler (Do I understand it right?, I haven't go
> through all patches)

The point is that they would not need to learn the existing commands.
They could accept the community's "best practices" by running the
simple command to start background maintenance.

In an "enterprise" environment, the users would not even need to learn
the command in the first place, because the engineering tools team
could configure the maintenance tools using the necessary setup scripts
to built the repo.

> Yes, it could be a setup it once and forget, but,
> if there's a problem with their local repo, they will scratch
> their head to understand what wrong with them.
> 
> It's easier to destroy their repo, and it's harder to know what
> operation is running in their computer.

That's why we need to be careful. Luckily, these steps have been
tested in the wild for over a year with great success (as part of
VFS for Git).

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-07 10:54           ` Derrick Stolee
@ 2020-04-07 14:16             ` Danh Doan
  2020-04-07 14:30               ` Johannes Schindelin
  0 siblings, 1 reply; 57+ messages in thread
From: Danh Doan @ 2020-04-07 14:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Phillip Wood, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

On 2020-04-07 06:54:33-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/6/2020 8:58 PM, Danh Doan wrote:
> > On 2020-04-06 10:42:23-0400, Derrick Stolee <stolee@gmail.com> wrote:
> >> Of course, not every platform has "cron" but that just means we need a
> >> cross-platform way to launch Git processes on some schedule. That could
> >> be a command that creates a cron job on platforms that have it, and on
> > 
> > There's Unix system that doesn't have cron.
> > People could use other scheduler mechanism.
> > 
> > A lot of systemd users uses systemd-timer.
> > I'm using snooze.
> 
> Thanks for listing some alternatives. I'll look into these.

I didn't mean to list those alternatives as only possible
alternatives.

The point is people have their own preference to choose a scheduler
that suites their need.

Someone could use their own supervisor system with things like:

	#/bin/sh

	sleep 3600 # 1 hour
	exec git cmd

When "git cmd" exit, the supervisor will start the job again (because
it's down and it needs to be run).

> > Each of those set of utilities have different grammar and
> > configuration.
> > 
> >> Windows it could create a scheduled task instead.
> 
> >> 2. "run-on-repos" uses command-line arguments or config to launch "git
> >>    -C <dir> maintenance run" for all configured directories. The
> >>    intention is that this is launched on some schedule by a platform-
> >>    specific scheduling mechanism (i.e. cron).
> > 
> > So, IIUC, Git will have a _hard_ dependencies on cron on *nix?
> > Else, we're gonna received a bug-report that some tools doesn't work?
> 
> No. Such a dependency would be unacceptable. I'm just using cron
> as an example when available.

That will be too many possible solutions out there,
I'm still not convinced on adding a scheduler to Git.

> > I've seen some bug report in our distro that "git add -p" doesn't work
> > like documented, because it's in "git-perl" packages.
> > When we merge "git-perl" back to git, other people (who never use
> > "git add -p" and git-sendemail) complain why does we add a hard dependencies
> > on perl to git.
> 
> Good news: "git add -p" is becoming a builtin with a lot of work by
> some determined contributors.

Yeah, I knew it. t3701.{44,46} is also fixed with the builtin.
But, it will be some version into the future to be enabled by default.

The point is there're people that don't want to see a new hard
dependencies for Git.

-- 
Danh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07 10:59       ` Derrick Stolee
@ 2020-04-07 14:26         ` Danh Doan
  2020-04-07 14:43           ` Johannes Schindelin
  0 siblings, 1 reply; 57+ messages in thread
From: Danh Doan @ 2020-04-07 14:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

On 2020-04-07 06:59:08-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/6/2020 8:50 PM, Danh Doan wrote:
> > On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
> >> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> >>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>>
> >>>>  * git run-job <job-name>: This builtin will run a single instance of a
> >>>>    maintenance job.
> >>>>    
> >>>>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> >>>>    that executes git run-job as a subcommand.
> >>>
> >>> What does this have to do with "git", though?  IOW, why does this
> >>> have to be part of Git, so that those who would benefit from having
> >>> a mechanism that makes it easy to run regular maintenance tasks but
> >>> are not Git users (or those that want to do such maintenance tasks
> >>> that are not necessarily tied to "git") must use "git" to do so?
> > 
> > I also agree with Junio,
> > I don't think Git should be responsible to be a scheduler.
> > It's the job of either tranditional crontab, at on *nix, or scheduler
> > on Windows.
> > 
> >> That's a reasonable reaction. The short version of my reasoning is that
> >> many many people _use_ Git but are not Git experts. While a Git expert
> >> could find the right set of commands to run and at what frequency to
> >> keep their repo clean, most users do not want to spend time learning
> >> these commands. It's also worth our time as contributors to select what
> > 
> > And now, people will need to learn _both_ Git existing maintainance
> > command, and new scheduler (Do I understand it right?, I haven't go
> > through all patches)
> 
> The point is that they would not need to learn the existing commands.
> They could accept the community's "best practices" by running the
> simple command to start background maintenance.

We could provide some "best practices" by an FAQ.
People can refer to it for "best practices" and run their favourite
choice of scheduler.

> In an "enterprise" environment, the users would not even need to learn
> the command in the first place, because the engineering tools team
> could configure the maintenance tools using the necessary setup scripts
> to built the repo.

In that "enterprise" environment, if the engineering tools team could
configure the maintainance tools using the command that introduced
together with this series, that very engineering tools team could
configure the scheduler to run required Git command, or create their
own wrappers. In such "enterprise" environment, most of computers'
software set are configured to be installed, the engineering tools
team know which software're installed in which system, they should
know which set of scheduler should be run, it should be simple for
them to configure their system.

> > Yes, it could be a setup it once and forget, but,
> > if there's a problem with their local repo, they will scratch
> > their head to understand what wrong with them.
> > 
> > It's easier to destroy their repo, and it's harder to know what
> > operation is running in their computer.
> 
> That's why we need to be careful. Luckily, these steps have been
> tested in the wild for over a year with great success (as part of
> VFS for Git).

No offence but I find this quote could be applied:

There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. - Tony Hoare.

Adding this set of commands to Git gonna made Git over-complicated,
IMHO.

-- 
Danh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/15] run-job: create barebones builtin
  2020-04-07 14:16             ` Danh Doan
@ 2020-04-07 14:30               ` Johannes Schindelin
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Schindelin @ 2020-04-07 14:30 UTC (permalink / raw)
  To: Danh Doan
  Cc: Derrick Stolee, Junio C Hamano, Phillip Wood,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Hi,

On Tue, 7 Apr 2020, Danh Doan wrote:

> On 2020-04-07 06:54:33-0400, Derrick Stolee <stolee@gmail.com> wrote:
> > On 4/6/2020 8:58 PM, Danh Doan wrote:
> > > On 2020-04-06 10:42:23-0400, Derrick Stolee <stolee@gmail.com> wrote:
> > >> Of course, not every platform has "cron" but that just means we need a
> > >> cross-platform way to launch Git processes on some schedule. That could
> > >> be a command that creates a cron job on platforms that have it, and on
> > >
> > > There's Unix system that doesn't have cron.
> > > People could use other scheduler mechanism.
> > >
> > > A lot of systemd users uses systemd-timer.
> > > I'm using snooze.
> >
> > Thanks for listing some alternatives. I'll look into these.
>
> I didn't mean to list those alternatives as only possible
> alternatives.

In contrast, I think that they are _really_ alternatives, and they are
only options for people who are dedicated fans of fiddling with the
technical details.

In other words, `cron` is a very viable option for a few people who are
_not_ in the audience of this here patch series.

The audience of this patch series are software engineers who _have_ to use
Git, but do not really want to spend their time learning about the
internal details. For those developers, especially those working on
insanely large repositories, we want to provide some convenient functions
(much like `git gc --auto` tries to help developers who do not want to
bother with Git details, but _better_ because it tries very much to stay
_out_ of the way of the developer, which `git gc --auto` distinctly does
_not_) that were developed using the experience with the world's largest
repository.

> The point is people have their own preference to choose a scheduler
> that suites their need.

And they can!

But again, this here patch series is obviously for those who do not want
to tinker with Git's functionality, yet still want to have decent
performance.

Learning from the experience that led to the invention of `git gc --auto`,
there is _a large_ benefit in doing this: `git gc --auto` was invented
because some prolific Linux contributors were experiencing abysmal
performance because they did not want to spend time learning how to
keep their repositories in a good shape, but rather they wanted to spend
time developing Linux kernel code.

> Someone could use their own supervisor system with things like:
>
> 	#/bin/sh
>
> 	sleep 3600 # 1 hour
> 	exec git cmd
>
> When "git cmd" exit, the supervisor will start the job again (because
> it's down and it needs to be run).

Sure. And that would work for developers who are interested in how the
world works, and have enough time to learn about this.

In my experience the vast majority of enterprise software developers are
not really as excited about Git internals as I am. This patch series is
for them. Because they are good people, too, and deserve our care in
supporting them.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07 14:26         ` Danh Doan
@ 2020-04-07 14:43           ` Johannes Schindelin
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Schindelin @ 2020-04-07 14:43 UTC (permalink / raw)
  To: Danh Doan
  Cc: Derrick Stolee, Junio C Hamano, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

Hi Danh,

On Tue, 7 Apr 2020, Danh Doan wrote:

> Adding this set of commands to Git gonna made Git over-complicated,
> IMHO.

Let me counter that by The Tale of the Green Button: while it is not
completely historically accurate, it is a good illustration of focusing on
What Matters Most:

Once upon a time, Xerox built really nice copy machines that were very
good and did a lot of useful things such as making one or more copies of
one or more pages at a certain set of paper sizes with a certain set of
paper criteria (heavy paper, shiny paper, double-sided copies) etc.

And the tale goes: nobody used it. Then a usability expert came in,
interviewed the users and figured out that it was too complicated to use.
The result was the big green button with which you can make one copy using
the standard paper size and the default paper.

This set of commands that you are complaining about is intended to be that
Big Green Button.

Also, maybe we should not talk too loudly about "making" Git
over-complicated. If you care to have a look at
https://git-man-page-generator.lokaltog.net/ and find yourself being very
much reminded of the current Git User Interface's complexity, you might
agree that we should probably try to make _using_ Git less complicated.

Even if it means adding new commands. Such as `git restore` and `git
switch`. And, yes, like `git job-runner` (or whatever we end up calling
it, I do agree with Junio that `git maintenance` is a nice name for its
intended purpose).

Don't just believe me. I invite you to interview the software engineers
developing the Windows Operating System,. Or for that matter, _any_
software engineers working on projects substantially larger than git.git.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07  1:48     ` brian m. carlson
@ 2020-04-07 20:08       ` Junio C Hamano
  2020-04-07 22:23       ` Johannes Schindelin
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-04-07 20:08 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> If there are periodic tasks that should be done, even if only on large
> repos, then let's have a git gc --periodic that does them.  I'm not sure
> that fetch should be in that set, but nothing prevents users from doing
> "git fetch origin && git gc --periodic".  Let's make it as simple and
> straightforward as possible.

Yeah, thanks for bringing up "git gc" here.  Earlier I mumbled
sometrhing about "git repack", which seemed to have needlessly
muddied the discussion, but "gc" is exactly the "we already have a
wrapper the users are familiar with---why not extend it" thing.

> I'm not opposed to seeing a tool that can schedule periodic maintenance
> jobs, perhaps in contrib, depending on whether other people think it
> should go.  However, I think running periodic jobs is best handled on
> Unix with cron or anacron and not a custom tool or a command in Git.
>
> I've dealt with systems that implemented periodic tasks without using
> the existing tools for doing that, and I've found that usually that's a
> mistake.  Despite seeming straightforward, there are a lot of tricky
> edge cases to deal with and it's easy to get wrong.
>
> We also don't have to reimplement all the features in the system
> scheduler and can let expert users use a different tool of their choice
> instead if cron (or the Windows equivalent) is not to their liking.

;-)  You said it much better than I would have myself.  Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07  1:48     ` brian m. carlson
  2020-04-07 20:08       ` Junio C Hamano
@ 2020-04-07 22:23       ` Johannes Schindelin
  2020-04-08  0:01         ` brian m. carlson
  1 sibling, 1 reply; 57+ messages in thread
From: Johannes Schindelin @ 2020-04-07 22:23 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Derrick Stolee, Junio C Hamano, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

Hi brian,

On Tue, 7 Apr 2020, brian m. carlson wrote:

> On 2020-04-04 at 00:16:21, Derrick Stolee wrote:
> > On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >
> > >>  * git run-job <job-name>: This builtin will run a single instance of a
> > >>    maintenance job.
> > >>
> > >>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> > >>    that executes git run-job as a subcommand.
> > >
> > > What does this have to do with "git", though?  IOW, why does this
> > > have to be part of Git, so that those who would benefit from having
> > > a mechanism that makes it easy to run regular maintenance tasks but
> > > are not Git users (or those that want to do such maintenance tasks
> > > that are not necessarily tied to "git") must use "git" to do so?
> > >
> > > I'll find out later why it is so after reading thru 15 patches
> > > myself, so no need to give a quick answer to the above; it was just
> > > my knee-jerk reaction.
> >
> > That's a reasonable reaction. The short version of my reasoning is that
> > many many people _use_ Git but are not Git experts. While a Git expert
> > could find the right set of commands to run and at what frequency to
> > keep their repo clean, most users do not want to spend time learning
> > these commands. It's also worth our time as contributors to select what
> > a good set of non-intrusive maintenance tasks could be, and make them
> > easily accessible to users.
> >
> > This series gets us half of the way there: a user interested in doing
> > background maintenance could figure out how to launch "git run-job" on
> > a schedule for their platform, or to launch "git job-runner" at start-
> > up. That's a lot simpler than learning how the commit-graph,
> > multi-pack-index, prune-packed, pack-objects, and fetch builtins work
> > with the complicated sets of arguments.
>
> If there are periodic tasks that should be done, even if only on large
> repos, then let's have a git gc --periodic that does them.  I'm not sure
> that fetch should be in that set, but nothing prevents users from doing
> "git fetch origin && git gc --periodic".

Hmm. Who says that maintenance tasks are essentially only `gc`? With
_maaaaaybe_ a `fetch` thrown in?

And about "nothing prevents users from doing ...": while that is true,
_even less_ is preventing those same users from _forgetting_ to run them,
or _not even knowing_ about the need to run them.

I vividly remember how `git gc --auto` was introduced. A Linux developer
too many was complaining about their Git operations becoming slower and
slower over the course of some months.

That is a very illustrative example of what we're trying to accomplish
here: we are dealing with people whose job it is not to become Git
experts, but whose job is to develop software, and it just so happens that
they use Git as version control system. For the most part, they are not
even all that interested in the internals of Git. Strange, I know.

The solution then, and now, is to teach Git to do these things for the
users, without the need to know a specific command and when to run it.

I am not saying that the current state of the design is perfect. What I am
saying is that the goal is very clear, and important, and we must try to
get there: to teach Git to run maintenance for the user, without much in
the way of the user asking for it, but Git knowing when to do what, and
trying mostly to stay out of the users' way.

> Let's make it as simple and straightforward as possible.

I fear that we're running the danger of confusing _two_ meanings of "as
simple and straightforward as possible" here.

From my perspective, Stolee's patches aim to make using Git as simple and
straightforward as possible, and the historical record shows that Scalar
(after which they are modeled) does a pretty good job at that, or is at
least heading in the right direction.

I get the impression, however, that many reviewers here seem to favor the
goal of making the _patches_ as simple and straightforward as possible,
however, at the expense of the original goal. Like, totally sacrificing
the ease of use in return for "just use a shell script" advice.

> As for handling multiple repositories, the tool to do that could be as
> simple as a shell script which reads from ~/.config/git/repo-maintenance
> (or whatever) and runs the same command on all of the repos it finds
> there, possibly with a subcommand to add and remove repos.

Sure, that is flexible.

And it requires a ton of Git expertise to know what to put into those
scripts. And Git updates cannot deliver more value to those scripts.

Which taps right into the above-mentioned focus on _the patches'_
simplicity, as opposed to the original goal of making _using Git_ simpler
and less painful with big repositories.

> > The second half would be to create a command such as
> >
> > 	git please-run-maintenance-on-this-repo
> >
> > that initializes the background jobs and enables them on the repo they
> > are using. This allows the most casual of Git user to work efficiently
> > on very large repositories.
>
> I'm not opposed to seeing a tool that can schedule periodic maintenance
> jobs, perhaps in contrib, depending on whether other people think it
> should go.  However, I think running periodic jobs is best handled on
> Unix with cron or anacron and not a custom tool or a command in Git.

Okay, here is a challenge for you: design this such that the Windows
experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
is a scheduler. Yep, it does not do cron-like things. Precisely: you have
to feed it an XML to make use of the "advanced" features. Yeah, I also
cannot remember what the semantics are regarding missed jobs due to
shutdown cycles. Nope, you cannot rely on the XML being an option, that
would require Windows 10. The list goes on.

Do you see where I am getting at?

It's really asking the user to take on _all_ the burden of the complexity.
The complexity has to live _somewhere_, after all, and if you want to
prevent Git from taking care of it, there are only the users left to hold
the candle.

I'd rather Git take up that responsibility, and lift that burden from the
users' shoulders. It's not like we have a stellar record of usability, so
we might just as well welcome Stolee's effort to improve Git in that
respect.

> I've dealt with systems that implemented periodic tasks without using
> the existing tools for doing that, and I've found that usually that's a
> mistake.  Despite seeming straightforward, there are a lot of tricky
> edge cases to deal with and it's easy to get wrong.

That might be true for general-purpose periodic taks managers.

But that's not what we need in Git, do we? The scope is a _lot_ more
reduced.

But maybe you found one of those issues in Stolee's patches? If so, please
do contribute your experience there to point out those issues, so that
they can be addressed.

> We also don't have to reimplement all the features in the system
> scheduler and can let expert users use a different tool of their choice
> instead if cron (or the Windows equivalent) is not to their liking.

Do we really want to start relying on `cron`, when the major platform used
by the target audience (enterprise software engineers who deal with rather
larger repositories than git.git or linux.git) quite obviously _lacks_
support for that?

That would be a decision that would not make sense to me. I mean, why go
to all the lengths of making it easier and less involved to manage large
repositories, and then just slam the door shut into the faces of all of
those developers working on Windows, an operating system with a rather
large market share?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-07 22:23       ` Johannes Schindelin
@ 2020-04-08  0:01         ` brian m. carlson
  2020-05-27 22:39           ` Josh Steadmon
  0 siblings, 1 reply; 57+ messages in thread
From: brian m. carlson @ 2020-04-08  0:01 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Derrick Stolee, Junio C Hamano, Derrick Stolee via GitGitGadget,
	git, peff, jrnieder, Derrick Stolee

[-- Attachment #1: Type: text/plain, Size: 5733 bytes --]

Hey,

On 2020-04-07 at 22:23:43, Johannes Schindelin wrote:
> > If there are periodic tasks that should be done, even if only on large
> > repos, then let's have a git gc --periodic that does them.  I'm not sure
> > that fetch should be in that set, but nothing prevents users from doing
> > "git fetch origin && git gc --periodic".
> 
> Hmm. Who says that maintenance tasks are essentially only `gc`? With
> _maaaaaybe_ a `fetch` thrown in?

What I'm saying is that we have a tool to run maintenance tasks on the
repository.  If we need to perform additional maintenance tasks, let's
put them in the same place as the ones we have now.  I realize "gc" may
become a less accurate name, but oh, well.

> > Let's make it as simple and straightforward as possible.
> 
> I get the impression, however, that many reviewers here seem to favor the
> goal of making the _patches_ as simple and straightforward as possible,
> however, at the expense of the original goal. Like, totally sacrificing
> the ease of use in return for "just use a shell script" advice.

I think we can have both.  They are not mutually exclusive, and I've
proposed a suggestion for both.

> > As for handling multiple repositories, the tool to do that could be as
> > simple as a shell script which reads from ~/.config/git/repo-maintenance
> > (or whatever) and runs the same command on all of the repos it finds
> > there, possibly with a subcommand to add and remove repos.
> 
> Sure, that is flexible.
> 
> And it requires a ton of Git expertise to know what to put into those
> scripts. And Git updates cannot deliver more value to those scripts.

Perhaps I was unclear what I thought could be the design of this.  My
proposal is something like the following:

  git schedule-gc add [--period=TIME] [--fetch=REMOTE | --fetch-all] REPO
  git schedule-gc remove REPO

The actual command invoked by the system scheduler would be something
like the following:

  git schedule-gc run

It would work as I proposed under the hood, but it would be relatively
straightforward to use.

> > I'm not opposed to seeing a tool that can schedule periodic maintenance
> > jobs, perhaps in contrib, depending on whether other people think it
> > should go.  However, I think running periodic jobs is best handled on
> > Unix with cron or anacron and not a custom tool or a command in Git.
> 
> Okay, here is a challenge for you: design this such that the Windows
> experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
> is a scheduler. Yep, it does not do cron-like things. Precisely: you have
> to feed it an XML to make use of the "advanced" features. Yeah, I also
> cannot remember what the semantics are regarding missed jobs due to
> shutdown cycles. Nope, you cannot rely on the XML being an option, that
> would require Windows 10. The list goes on.

I will freely admit that I know next to nothing about Windows.  I have
used it only incidentally, if at all, for at least two decades.  It is
not a platform I generally have an interest in developing for, although
I try to make it work as well as possible when I am working on a project
which supports it.

It is, in general, my assumption, based on its wide usage, that it is a
powerful and robust operating system with many features, but I have
little actual knowledge about how it functions or the exact features it
provides.

I want a solution that builds on the existing Unix tools for Unix,
because that is least surprising to users and it is how Unix tools are
supposed to work.  I think we can agree that Git was designed with the
Unix philosophy in mind.

I also want a solution that works on Windows.  Ideally that solution
would build on existing components that are part of Windows, because it
reduces the maintenance burden on all of us.  But unfortunately, I know
next to nothing about how to build such a solution.

> > I've dealt with systems that implemented periodic tasks without using
> > the existing tools for doing that, and I've found that usually that's a
> > mistake.  Despite seeming straightforward, there are a lot of tricky
> > edge cases to deal with and it's easy to get wrong.
> 
> But maybe you found one of those issues in Stolee's patches? If so, please
> do contribute your experience there to point out those issues, so that
> they can be addressed.

One of the benefits of using anacron on Unix is that it can skip running
tasks when the user is on battery.  This is not anything we can portably
do across systems, nor is it something that Git should need to know
about.

> > We also don't have to reimplement all the features in the system
> > scheduler and can let expert users use a different tool of their choice
> > instead if cron (or the Windows equivalent) is not to their liking.
> 
> Do we really want to start relying on `cron`, when the major platform used
> by the target audience (enterprise software engineers who deal with rather
> larger repositories than git.git or linux.git) quite obviously _lacks_
> support for that?

Unix users will be unhappy with us if we use our own scheduling system
when cron is available.  They will expect us to reimplement those
features and they will complain if we do not.  While I cannot name
names, there are a nontrivial number of large, enterprise monorepos that
run only on macOS and Linux.

That doesn't prevent us from building tooling that does the scheduling
on Windows if we can't use the system scheduler, but it would be nice to
try to present a relatively unified interface across the two platforms.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02/15] run-job: implement commit-graph job
  2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
@ 2020-05-20 19:08   ` Josh Steadmon
  0 siblings, 0 replies; 57+ messages in thread
From: Josh Steadmon @ 2020-05-20 19:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

Thanks for this series, and sorry for the delayed (and partial) review.
I hope comments here are still useful. I don't have any comments on the
larger design that other people haven't already made, but I do have a
few small questions and suggestions, inline below.


On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The first job to implement in the 'git run-job' command is the
> 'commit-graph' job. It is based on the sequence of events in the
> 'commit-graph' job in Scalar [1]. This sequence is as follows:
> 
> 1. git commit-graph write --reachable --split
> 2. git commit-graph verify --shallow
> 3. If the verify succeeds, stop.
> 4. Delete the commit-graph-chain file.
> 5. git commit-graph write --reachable --split
> 
> By writing an incremental commit-graph file using the "--split"
> option we minimize the disruption from this operation. The default
> behavior is to merge layers until the new "top" layer is less than
> half the size of the layer below. This provides quick writes "most"
> of the time, with the longer writes following a power law
> distribution.
> 
> Most importantly, concurrent Git processes only look at the
> commit-graph-chain file for a very short amount of time, so they
> will verly likely not be holding a handle to the file when we try
> to replace it. (This only matters on Windows.)
> 
> If a concurrent process reads the old commit-graph-chain file, but
> our job expires some of the .graph files before they can be read,
> then those processes will see a warning message (but not fail).
> This could be avoided by a future update to use the --expire-time
> argument when writing the commit-graph.
> 
> By using 'git commit-graph verify --shallow' we can ensure that
> the file we just wrote is valid. This is an extra safety precaution
> that is faster than our 'write' subcommand. In the rare situation
> that the newest layer of the commit-graph is corrupt, we can "fix"
> the corruption by deleting the commit-graph-chain file and rewrite
> the full commit-graph as a new one-layer commit graph. This does
> not completely prevent _that_ file from being corrupt, but it does
> recompute the commit-graph by parsing commits from the object
> database. In our use of this step in Scalar and VFS for Git, we
> have only seen this issue arise because our microsoft/git fork
> reverted 43d3561 ("commit-graph write: don't die if the existing
> graph is corrupt" 2019-03-25) for a while to keep commit-graph
> writes very fast. We dropped the revert when updating to v2.23.0.
> The verify still has potential for catching corrupt data across
> the layer boundary: if the new file has commit X with parent Y
> in an old file but the commit ID for Y in the old file had a
> bitswap, then we will notice that in the 'verify' command.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs
> 
> RFC QUESTIONS:
> 
> 1. Are links like [1] helpful?

Yes, I appreciate being able to reference these. However, given that
these will show up in commit logs for ~forever, and ongoing work might
happen on Scalar, perhaps the links should be pinned to a specific
revision?


> 
> 2. Can anyone think of a way to test the rewrite fallback?
>    It requires corrupting the latest file between two steps of
>    this one command, so that is a tricky spot to inject a fault.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 21 ++++++++++--
>  builtin/run-job.c             | 60 ++++++++++++++++++++++++++++++++++-
>  commit-graph.c                |  2 +-
>  commit-graph.h                |  1 +
>  t/t7900-run-job.sh            | 32 +++++++++++++++++++
>  5 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 0627b3ed259..8bf2762d650 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job <job-name>'
> +'git run-job commit-graph'
>  
>  
>  DESCRIPTION
> @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
>  JOBS
>  ----
>  
> -TBD
> +'commit-graph'::
> +
> +The `commit-graph` job updates the `commit-graph` files incrementally,
> +then verifies that the written data is correct. If the new layer has an
> +issue, then the chain file is removed and the `commit-graph` is
> +rewritten from scratch.
> ++
> +The verification only checks the top layer of the `commit-graph` chain.
> +If the incremental write merged the new commits with at least one
> +existing layer, then there is potential for on-disk corruption being
> +carried forward into the new file. This will be noticed and the new
> +commit-graph file will be clean as Git reparses the commit data from
> +the object database.
> ++
> +The incremental write is safe to run alongside concurrent Git processes
> +since it will not expire `.graph` files that were in the previous
> +`commit-graph-chain` file. They will be deleted by a later run based on
> +the expiration delay.
>  
>  
>  GIT
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index 2c78d053aa4..dd7709952d3 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,12 +1,65 @@
>  #include "builtin.h"
>  #include "config.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
>  #include "parse-options.h"
> +#include "repository.h"
> +#include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job"),
> +	N_("git run-job commit-graph"),
>  	NULL
>  };
>  
> +static int run_write_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "write",
> +			 "--split", "--reachable", "--no-progress",
> +			 NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_verify_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "verify",
> +			 "--shallow", "--no-progress", NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_commit_graph_job(void)
> +{
> +	char *chain_path;
> +
> +	if (run_write_commit_graph()) {
> +		error(_("failed to write commit-graph"));
> +		return 1;
> +	}
> +
> +	if (!run_verify_commit_graph())
> +		return 0;
> +
> +	warning(_("commit-graph verify caught error, rewriting"));
> +
> +	chain_path = get_chain_filename(the_repository->objects->odb);

Should we avoid new uses of `the_repository` and take a repo pointer
argument here instead?


> +	if (unlink(chain_path)) {
> +		UNLEAK(chain_path);
> +		die(_("failed to remove commit-graph at %s"), chain_path);
> +	}
> +	free(chain_path);
> +
> +	if (!run_write_commit_graph())
> +		return 0;

Is there a reason why we don't verify the second write attempt?


> +
> +	error(_("failed to rewrite commit-graph"));
> +	return 1;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			     builtin_run_job_usage,
>  			     PARSE_OPT_KEEP_UNKNOWN);
>  
> +	if (argc > 0) {
> +		if (!strcmp(argv[0], "commit-graph"))
> +			return run_commit_graph_job();
> +	}
> +
>  	usage_with_options(builtin_run_job_usage,
>  			   builtin_run_job_options);
>  }
> diff --git a/commit-graph.c b/commit-graph.c
> index f013a84e294..6867f92d04a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
>  		       oid_hex);
>  }
>  
> -static char *get_chain_filename(struct object_directory *odb)
> +char *get_chain_filename(struct object_directory *odb)
>  {
>  	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
>  }
> diff --git a/commit-graph.h b/commit-graph.h
> index e87a6f63600..8b7bd5a6cb1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,6 +13,7 @@
>  struct commit;
>  
>  char *get_commit_graph_filename(struct object_directory *odb);
> +char *get_chain_filename(struct object_directory *odb);
>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>  
>  /*
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 1eac80b7ed3..18b9bd26b3a 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -5,6 +5,8 @@ test_description='git run-job
>  Testing the background jobs, in the foreground
>  '
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +
>  . ./test-lib.sh
>  
>  test_expect_success 'help text' '
> @@ -12,4 +14,34 @@ test_expect_success 'help text' '
>  	test_i18ngrep "usage: git run-job" err
>  '
>  
> +test_expect_success 'commit-graph job' '
> +	git init server &&
> +	(
> +		cd server &&
> +		chain=.git/objects/info/commit-graphs/commit-graph-chain &&
> +		git checkout -b master &&
> +		for i in $(test_seq 1 15)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain &&
> +		for i in $(test_seq 16 19)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 2 $chain &&
> +		for i in $(test_seq 20 23)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain
> +	)
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
  2020-04-05 20:28   ` Junio C Hamano
@ 2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 0 replies; 57+ messages in thread
From: Josh Steadmon @ 2020-05-20 19:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.

It might be useful here to clarify the interaction between background
fetching & the expectations around --force-with-lease.


> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations? Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?
> 
> 2. This feature is designed for a desktop machine or equivalent
>    that has a permanent wired network connection, and the machine
>    stays on while the user is not present. For things like laptops
>    with inconsistent WiFi connections (that may be metered) the
>    feature can use the less stable connection more than the user
>    wants. Of course, this feature is opt-in for Git, but in Scalar
>    we have a "scalar pause" command [2] that pauses all maintenance
>    for some amount of time. We should consider a similar mechanism
>    for Git, but for the point of this series the user needs to set
>    up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 13 ++++-
>  builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>  t/t7900-run-job.sh            | 22 +++++++++
>  3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>  
>  
>  DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>  
>  GIT
>  ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>  	NULL
>  };
>  
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));

Shouldn't remote_proc just go on the stack here rather than getting
malloced?

> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  	if (argc > 0) {
>  		if (!strcmp(argv[0], "commit-graph"))
>  			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>  	)
>  '
>  
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-28  0:47             ` Junio C Hamano
@ 2020-05-27 21:52               ` Johannes Schindelin
  2020-05-28 14:48                 ` Junio C Hamano
  2020-05-28 14:50                 ` Jonathan Nieder
  0 siblings, 2 replies; 57+ messages in thread
From: Johannes Schindelin @ 2020-05-27 21:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Josh Steadmon, brian m. carlson, Derrick Stolee,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Hi Junio,

On Wed, 27 May 2020, Junio C Hamano wrote:

> Josh Steadmon <steadmon@google.com> writes:
>
> > Regardless of what happens with the job-runner, I would like to see a
> > top-level command that performs a single iteration of all the
> > recommended maintenance steps, with zero configuration required, on a
> > single repo. This gives an entry point for users who want to manage
> > their own maintenance schedule without running a background process.
> > ...
> >> Unix users will be unhappy with us if we use our own scheduling system
> >> when cron is available.  They will expect us to reimplement those
> >> features and they will complain if we do not.  While I cannot name
> >> names, there are a nontrivial number of large, enterprise monorepos that
> >> run only on macOS and Linux.
> >
> > Speaking purely as a user, I agree with this point. This is why I want a
> > single-iteration top-level maintenance command.
>
> Yes, well said.
>
> It exactly is what "git gc" was meant to be.  To put it differently,
> if you asked any non-novice end-user if there is one single command
> that s/he would use to keep a repository healthy, it is very likely
> that the answer would be "git gc".

The biggest problem with bringing up `git gc` in this context (and it is
actually a quite big problem) is that "gc" stands for "garbage
collection", and these maintenance tasks are not even close to being about
collecting garbage.

So while `git gc` looks like a good candidate on its technical merits, the
usability/discoverability point of view paints a very different picture.

What Scalar does is conceptually a _very_ different thing from letting
`git gc --auto` (which I guess is what you _actually_ meant, as I have yet
to meet even a single Git user outside of this mailing list who knows `git
gc`, let alone who runs it manually) determine whether loose objects
should be cleaned up and packs should be consolidated.

Like, pre-fetching a daily pack in preparation for the user fetching
updates. You could argue that this is, in a way, _accumulating_ "garbage".

The entire idea of those maintenance tasks is that they are _not_
triggered by the user, not even as a side effect (`git gc --auto` is very
much a side effect, and on Windows, where it does not `daemonize()`
because that concept does not translate well into the Win32 API, having it
run in the foreground is very much felt by the users).

Those maintenance tasks should stay out of the users' way as much as
possible.

> And having such a single point of entry would be a good thing.

I guess I would argue for the introduction of a new command, like `git
maintenance`, which could potentially trigger a `git gc --auto`, but is
primarily intended to run _outside_ of the users' work hours.

Once we have that, we can always figure out whether there is a convenient
way to register this via `crontab` or Windows Task Scheduler, without
asking the users to do all of these tedious steps manually. I, for one,
have to spend the extra time looking up what those positional numbers in
the `crontab` thing _mean_, _every_ _single_ _time_ I touch the `crontab`.
_Every_ _single_ _time_.

Therefore, I would like very much to have a `git maintenance --schedule`
(or something like that), even if only on Windows, on the grounds alone
that it is even more tedious to work with the Windows Task Scheduler. But
I would prefer to have that also in my Linux setup. The convenience for
the users (myself included) is just too compelling.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/15] run-job: implement pack-files job
  2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
@ 2020-05-27 22:17   ` Josh Steadmon
  0 siblings, 0 replies; 57+ messages in thread
From: Josh Steadmon @ 2020-05-27 22:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The previous change cleaned up loose objects using the
> 'loose-objects' that can be run safely in the background. Add a
> similar job that performs similar cleanups for pack-files.
> 
> One issue with running 'git repack' is that it is designed to
> repack all pack-files into a single pack-file. While this is the
> most space-efficient way to store object data, it is not time or
> memory efficient. This becomes extremely important if the repo is
> so large that a user struggles to store two copies of the pack on
> their disk.
> 
> Instead, perform an "incremental" repack by collecting a few small
> pack-files into a new pack-file. The multi-pack-index facilitates
> this process ever since 'git multi-pack-index expire' was added in
> 19575c7 (multi-pack-index: implement 'expire' subcommand,
> 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
> (midx: implement midx_repack(), 2019-06-10).
> 
> The 'pack-files' job runs the following steps:
> 
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
> 
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."
> 
> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the

Typo: updaes -> updates


>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git

Is it still safe for concurrent processes if the repo did not have a
multi-pack-index when the first process started?


>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)
> 
> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set

Typo: bacth-size -> batch-size


>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

Typo: approximats -> approximates


>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.
> 
>    The next run of the pack-files job will delete these repacked
>    pack-files during the 'expire' step.
> 
>    In this version, the batch size is set to "0" which ignores the
>    size restrictions when selecting the pack-files. It instead
>    selects all pack-files and repacks all packed objects into a
>    single pack-file. This will be updated in the next change, but
>    it requires doing some calculations that are better isolated to
>    a separate change.
> 
> Each of the above steps update the multi-pack-index file. After
> each step, we verify the new multi-pack-index. If the new
> multi-pack-index is corrupt, then delete the multi-pack-index,
> rewrite it from scratch, and stop doing the later steps of the
> job. This is intended to be an extra-safe check without leaving
> a repo with many pack-files without a multi-pack-index.
> 
> These steps are based on a similar background maintenance step in
> Scalar (and VFS for Git) [1]. This was incredibly effective for
> users of the Windows OS repository. After using the same VFS for Git
> repository for over a year, some users had _thousands_ of pack-files
> that combined to up to 250 GB of data. We noticed a few users were
> running into the open file descriptor limits (due in part to a bug
> in the multi-pack-index fixed by af96fe3392 (midx: add packs to
> packed_git linked list, 2019-04-29).
> 
> These pack-files were mostly small since they contained the commits
> and trees that were pushed to the origin in a given hour. The GVFS
> protocol includes a "prefetch" step that asks for pre-computed pack-
> files containing commits and trees by timestamp. These pack-files
> were grouped into "daily" pack-files once a day for up to 30 days.
> If a user did not request prefetch packs for over 30 days, then they
> would get the entire history of commits and trees in a new, large
> pack-file. This led to a large number of pack-files that had poor
> delta compression.
> 
> By running this pack-file maintenance step once per day, these repos
> with thousands of packs spanning 200+ GB dropped to dozens of pack-
> files spanning 30-50 GB. This was done all without removing objects
> from the system and using a constant batch size of two gigabytes.
> Once the work was done to reduce the pack-files to small sizes, the
> batch size of two gigabytes means that not every run triggers a
> repack operation, so the following run will not expire a pack-file.
> This has kept these repos in a "clean" state.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 18 ++++++-
>  builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
>  midx.c                        |  2 +-
>  midx.h                        |  1 +
>  t/t7900-run-job.sh            | 39 +++++++++++++++
>  5 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 43ca1160b5a..108ed25b8bd 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job (commit-graph|fetch|loose-objects)'
> +'git run-job (commit-graph|fetch|loose-objects|pack-files)'
>  
>  
>  DESCRIPTION
> @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
>  objects to prevent the job from taking too long on a repository with
>  many loose objects.
>  
> +'pack-files'::
> +
> +The `pack-files` job incrementally repacks the object directory using
> +the `multi-pack-index` feature. In order to prevent race conditions with
> +concurrent Git commands, it follows a two-step process. First, it
> +deletes any pack-files included in the `multi-pack-index` where none of
> +the objects in the `multi-pack-index` reference those pack-files; this
> +only happens if all objects in the pack-file are also stored in a newer
> +pack-file. Second, it selects a group of pack-files whose "expected
> +size" is below the batch size until the group has total expected size at
> +least the batch size; see the `--batch-size` option for the `repack`
> +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
> +zero, which is a special case that attempts to repack all pack-files
> +into a single pack-file.
> +
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,13 +1,14 @@
>  #include "builtin.h"
>  #include "config.h"
>  #include "commit-graph.h"
> +#include "midx.h"
>  #include "object-store.h"
>  #include "parse-options.h"
>  #include "repository.h"
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job (commit-graph|fetch|loose-objects)"),
> +	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
>  	NULL
>  };
>  
> @@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
>  	return prune_packed() || pack_loose();
>  }
>  
> +static int multi_pack_index_write(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "write",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int rewrite_multi_pack_index(void)
> +{
> +	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
> +
> +	unlink(midx_name);
> +	free(midx_name);
> +
> +	if (multi_pack_index_write()) {
> +		error(_("failed to rewrite multi-pack-index"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int multi_pack_index_verify(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "verify",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_expire(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "expire",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_repack(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "repack",
> +			 "--no-progress", "--batch-size=0", NULL);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	if (result && multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after repack"));
> +		result = rewrite_multi_pack_index();
> +	}
> +
> +	return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			return run_fetch_job();
>  		if (!strcmp(argv[0], "loose-objects"))
>  			return run_loose_objects_job();
> +		if (!strcmp(argv[0], "pack-files"))
> +			return run_pack_files_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/midx.c b/midx.c
> index 1527e464a7b..0f0d0a38812 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -36,7 +36,7 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
> diff --git a/midx.h b/midx.h
> index e6fa356b5ca..cf2c09dffc2 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -39,6 +39,7 @@ struct multi_pack_index {
>  
>  #define MIDX_PROGRESS     (1 << 0)
>  
> +char *get_midx_filename(const char *object_dir);
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 41da083257b..416ba04989d 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
>  '
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
>  
>  . ./test-lib.sh
>  
> @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
>  	test_cmp packs-between packs-after
>  '
>  
> +test_expect_success 'pack-files job' '
> +	packDir=.git/objects/pack &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +
> +	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +
> +	rm -f client/$packDir/pack-* &&
> +	rm -f client/$packDir/loose-* &&
> +
> +	ls client/$packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git -C client run-job pack-files &&
> +	ls client/$packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git -C client run-job pack-files &&
> +	ls client/.git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-04-08  0:01         ` brian m. carlson
@ 2020-05-27 22:39           ` Josh Steadmon
  2020-05-28  0:47             ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Josh Steadmon @ 2020-05-27 22:39 UTC (permalink / raw)
  To: brian m. carlson, Johannes Schindelin, Derrick Stolee,
	Junio C Hamano, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

I'm late to the discussion, but I'd like to chime in here too.


On 2020.04.08 00:01, brian m. carlson wrote:
> Hey,
> 
> On 2020-04-07 at 22:23:43, Johannes Schindelin wrote:
> > > If there are periodic tasks that should be done, even if only on large
> > > repos, then let's have a git gc --periodic that does them.  I'm not sure
> > > that fetch should be in that set, but nothing prevents users from doing
> > > "git fetch origin && git gc --periodic".
> > 
> > Hmm. Who says that maintenance tasks are essentially only `gc`? With
> > _maaaaaybe_ a `fetch` thrown in?
> 
> What I'm saying is that we have a tool to run maintenance tasks on the
> repository.  If we need to perform additional maintenance tasks, let's
> put them in the same place as the ones we have now.  I realize "gc" may
> become a less accurate name, but oh, well.
> 
> > > Let's make it as simple and straightforward as possible.
> > 
> > I get the impression, however, that many reviewers here seem to favor the
> > goal of making the _patches_ as simple and straightforward as possible,
> > however, at the expense of the original goal. Like, totally sacrificing
> > the ease of use in return for "just use a shell script" advice.
> 
> I think we can have both.  They are not mutually exclusive, and I've
> proposed a suggestion for both.
> 
> > > As for handling multiple repositories, the tool to do that could be as
> > > simple as a shell script which reads from ~/.config/git/repo-maintenance
> > > (or whatever) and runs the same command on all of the repos it finds
> > > there, possibly with a subcommand to add and remove repos.
> > 
> > Sure, that is flexible.
> > 
> > And it requires a ton of Git expertise to know what to put into those
> > scripts. And Git updates cannot deliver more value to those scripts.
> 
> Perhaps I was unclear what I thought could be the design of this.  My
> proposal is something like the following:
> 
>   git schedule-gc add [--period=TIME] [--fetch=REMOTE | --fetch-all] REPO
>   git schedule-gc remove REPO
> 
> The actual command invoked by the system scheduler would be something
> like the following:
> 
>   git schedule-gc run
> 
> It would work as I proposed under the hood, but it would be relatively
> straightforward to use.

Regardless of what happens with the job-runner, I would like to see a
top-level command that performs a single iteration of all the
recommended maintenance steps, with zero configuration required, on a
single repo. This gives an entry point for users who want to manage
their own maintenance schedule without running a background process.


> > > I'm not opposed to seeing a tool that can schedule periodic maintenance
> > > jobs, perhaps in contrib, depending on whether other people think it
> > > should go.  However, I think running periodic jobs is best handled on
> > > Unix with cron or anacron and not a custom tool or a command in Git.
> > 
> > Okay, here is a challenge for you: design this such that the Windows
> > experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
> > is a scheduler. Yep, it does not do cron-like things. Precisely: you have
> > to feed it an XML to make use of the "advanced" features. Yeah, I also
> > cannot remember what the semantics are regarding missed jobs due to
> > shutdown cycles. Nope, you cannot rely on the XML being an option, that
> > would require Windows 10. The list goes on.
> 
> I will freely admit that I know next to nothing about Windows.  I have
> used it only incidentally, if at all, for at least two decades.  It is
> not a platform I generally have an interest in developing for, although
> I try to make it work as well as possible when I am working on a project
> which supports it.
> 
> It is, in general, my assumption, based on its wide usage, that it is a
> powerful and robust operating system with many features, but I have
> little actual knowledge about how it functions or the exact features it
> provides.
> 
> I want a solution that builds on the existing Unix tools for Unix,
> because that is least surprising to users and it is how Unix tools are
> supposed to work.  I think we can agree that Git was designed with the
> Unix philosophy in mind.
> 
> I also want a solution that works on Windows.  Ideally that solution
> would build on existing components that are part of Windows, because it
> reduces the maintenance burden on all of us.  But unfortunately, I know
> next to nothing about how to build such a solution.
> 
> > > I've dealt with systems that implemented periodic tasks without using
> > > the existing tools for doing that, and I've found that usually that's a
> > > mistake.  Despite seeming straightforward, there are a lot of tricky
> > > edge cases to deal with and it's easy to get wrong.
> > 
> > But maybe you found one of those issues in Stolee's patches? If so, please
> > do contribute your experience there to point out those issues, so that
> > they can be addressed.
> 
> One of the benefits of using anacron on Unix is that it can skip running
> tasks when the user is on battery.  This is not anything we can portably
> do across systems, nor is it something that Git should need to know
> about.
> 
> > > We also don't have to reimplement all the features in the system
> > > scheduler and can let expert users use a different tool of their choice
> > > instead if cron (or the Windows equivalent) is not to their liking.
> > 
> > Do we really want to start relying on `cron`, when the major platform used
> > by the target audience (enterprise software engineers who deal with rather
> > larger repositories than git.git or linux.git) quite obviously _lacks_
> > support for that?
> 
> Unix users will be unhappy with us if we use our own scheduling system
> when cron is available.  They will expect us to reimplement those
> features and they will complain if we do not.  While I cannot name
> names, there are a nontrivial number of large, enterprise monorepos that
> run only on macOS and Linux.

Speaking purely as a user, I agree with this point. This is why I want a
single-iteration top-level maintenance command.

Once we have that, we can provide recommended configs for existing
scheduling solutions (cron, launchd, systemd, etc.) in contrib/. If the
Windows scheduler is cumbersome enough that users don't want to use it,
then I think it's perfectly reasonable to provide our own limited
job-runner in contrib/ as well, so long as we don't require people to
use it.

> That doesn't prevent us from building tooling that does the scheduling
> on Windows if we can't use the system scheduler, but it would be nice to
> try to present a relatively unified interface across the two platforms.
> -- 
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-27 22:39           ` Josh Steadmon
@ 2020-05-28  0:47             ` Junio C Hamano
  2020-05-27 21:52               ` Johannes Schindelin
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-05-28  0:47 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: brian m. carlson, Johannes Schindelin, Derrick Stolee,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Josh Steadmon <steadmon@google.com> writes:

> Regardless of what happens with the job-runner, I would like to see a
> top-level command that performs a single iteration of all the
> recommended maintenance steps, with zero configuration required, on a
> single repo. This gives an entry point for users who want to manage
> their own maintenance schedule without running a background process.
> ...
>> Unix users will be unhappy with us if we use our own scheduling system
>> when cron is available.  They will expect us to reimplement those
>> features and they will complain if we do not.  While I cannot name
>> names, there are a nontrivial number of large, enterprise monorepos that
>> run only on macOS and Linux.
>
> Speaking purely as a user, I agree with this point. This is why I want a
> single-iteration top-level maintenance command.

Yes, well said.  

It exactly is what "git gc" was meant to be.  To put it differently,
if you asked any non-novice end-user if there is one single command
that s/he would use to keep a repository healthy, it is very likely
that the answer would be "git gc".

And having such a single point of entry would be a good thing.

Thanks.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-28 15:30                       ` Derrick Stolee
@ 2020-05-28  4:39                         ` Johannes Schindelin
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Schindelin @ 2020-05-28  4:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jonathan Nieder, Junio C Hamano, Josh Steadmon, brian m. carlson,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Hi Stolee,

On Thu, 28 May 2020, Derrick Stolee wrote:

> On 5/28/2020 11:03 AM, Jonathan Nieder wrote:
> > Junio C Hamano wrote:
> >> Jonathan Nieder <jrnieder@gmail.com> writes:
> >
> >>>                              The real question is, do we consider the
> >>> existing "git gc" infrastructure such a lost cause that we should
> >>> touch it as little as possible?
> >>
> >> I am fine with that, as long as the "new" thing will take over what
> >> "git gc" currently does.
> >
> > Good reminder, thank you.
> >
> > Yes, as long as we end up replacing the old thing, making a parallel
> > new thing (e.g. with a config option for switching between during a
> > transition period) can be a fine approach.
>
> Thanks for the discussion, everyone. I'm sorry that I'm very late in
> providing a new RFC that takes this approach, but yes I intend to create
> the "single entry point" for maintenance activities, and incorporate
> auto-GC as a default option there.

Great! I look forward to reviewing your next patch series iteration,
whenever you're ready.

Ciao,
Dscho

>
> Something that is a good long-term goal is to have the new maintenance
> entry point replace the "git gc --auto" calls, so we have better
> customization of post-command "automatic" maintenance. This can be done
> without any of the "background" part of my original RFC.
>
> I've just been to busy with other tasks to devote the deep focus that
> this feature deserves. Thanks for your patience.
>
> -Stolee
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-27 21:52               ` Johannes Schindelin
@ 2020-05-28 14:48                 ` Junio C Hamano
  2020-05-28 14:50                 ` Jonathan Nieder
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-05-28 14:48 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Josh Steadmon, brian m. carlson, Derrick Stolee,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> It exactly is what "git gc" was meant to be.  To put it differently,
>> if you asked any non-novice end-user if there is one single command
>> that s/he would use to keep a repository healthy, it is very likely
>> that the answer would be "git gc".
>
> The biggest problem with bringing up `git gc` in this context (and it is
> actually a quite big problem) is that "gc" stands for ...

Ah, no, I wasn't saying "the single command has to be called 'gc'".
I do not care too much about the name, as it will be hidden scripted
away.

I was responding to Josh's "We want a single point of entry for all
the housekeeping tasks" with "yes---it is not any controversial new
idea---we have had such a single entry point for housekeeping tasks
in 'gc' for a long time and we are receptive of having such a single
command".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-27 21:52               ` Johannes Schindelin
  2020-05-28 14:48                 ` Junio C Hamano
@ 2020-05-28 14:50                 ` Jonathan Nieder
  2020-05-28 14:57                   ` Junio C Hamano
  1 sibling, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2020-05-28 14:50 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Josh Steadmon, brian m. carlson, Derrick Stolee,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

Hi,

Johannes Schindelin wrote:

> So while `git gc` looks like a good candidate on its technical merits, the
> usability/discoverability point of view paints a very different picture.

Here is what I take away from that comment:

    It was a mistake for "git gc" to take responsibility for operations
    like "git pack-refs".  We should migrate to a new "git maintenance
    run" command and make "git gc" print a signpost to help people find
    the new name.

That seems like a reasonable direction to me, but as you're hinting,
from a technical design perspective it doesn't change where e.g. midx
maintenance operations would go.  I would expect the routine
maintenance command, whatever it is called, to perform such
operations.  I would not expect to have to choose between two subtly
different maintenance commands, or in other words to have to pay as a
user for an unrelated naming choice.

I think it's worth discussing naming, but it's kind of a distraction
from what Brian brought up.  The real question is, do we consider the
existing "git gc" infrastructure such a lost cause that we should
touch it as little as possible?

I don't think that was Stolee's intent --- instead, I supsect that it
was just the discoverability issue you mention that led to this series
forgetting about the possibility of incorporating this functionality
into "git gc".

[...]
> I guess I would argue for the introduction of a new command, like `git
> maintenance`, which could potentially trigger a `git gc --auto`, but is
> primarily intended to run _outside_ of the users' work hours.
>
> Once we have that, we can always figure out whether there is a convenient
> way to register this via `crontab` or Windows Task Scheduler, without
> asking the users to do all of these tedious steps manually.

Wonderful, it sounds like we're pretty much in agreement.

[...]
> Therefore, I would like very much to have a `git maintenance --schedule`
> (or something like that), even if only on Windows, on the grounds alone
> that it is even more tedious to work with the Windows Task Scheduler. But
> I would prefer to have that also in my Linux setup. The convenience for
> the users (myself included) is just too compelling.

I think Josh was arguing for deferring that part to a separate series.
Otherwise we could end up spending so much time focusing on scheduling
and service supervision that the series gets delayed and no one
benefits from Stolee's work to improve how routine packing works.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-28 14:50                 ` Jonathan Nieder
@ 2020-05-28 14:57                   ` Junio C Hamano
  2020-05-28 15:03                     ` Jonathan Nieder
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-05-28 14:57 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Johannes Schindelin, Josh Steadmon, brian m. carlson,
	Derrick Stolee, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

Jonathan Nieder <jrnieder@gmail.com> writes:

> I think it's worth discussing naming, but it's kind of a distraction
> from what Brian brought up.  The real question is, do we consider the
> existing "git gc" infrastructure such a lost cause that we should
> touch it as little as possible?

I am fine with that, as long as the "new" thing will take over what
"git gc" currently does.  It has always been bothering me that the
maintenance pieces for features added in the past few years like
midx were deliberately let outside the scope of "one entry point for
housekeeping".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-28 14:57                   ` Junio C Hamano
@ 2020-05-28 15:03                     ` Jonathan Nieder
  2020-05-28 15:30                       ` Derrick Stolee
  0 siblings, 1 reply; 57+ messages in thread
From: Jonathan Nieder @ 2020-05-28 15:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Josh Steadmon, brian m. carlson,
	Derrick Stolee, Derrick Stolee via GitGitGadget, git, peff,
	jrnieder, Derrick Stolee

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>>                              The real question is, do we consider the
>> existing "git gc" infrastructure such a lost cause that we should
>> touch it as little as possible?
>
> I am fine with that, as long as the "new" thing will take over what
> "git gc" currently does.

Good reminder, thank you.

Yes, as long as we end up replacing the old thing, making a parallel
new thing (e.g. with a config option for switching between during a
transition period) can be a fine approach.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/15] [RFC] Maintenance jobs and job runner
  2020-05-28 15:03                     ` Jonathan Nieder
@ 2020-05-28 15:30                       ` Derrick Stolee
  2020-05-28  4:39                         ` Johannes Schindelin
  0 siblings, 1 reply; 57+ messages in thread
From: Derrick Stolee @ 2020-05-28 15:30 UTC (permalink / raw)
  To: Jonathan Nieder, Junio C Hamano
  Cc: Johannes Schindelin, Josh Steadmon, brian m. carlson,
	Derrick Stolee via GitGitGadget, git, peff, jrnieder,
	Derrick Stolee

On 5/28/2020 11:03 AM, Jonathan Nieder wrote:
> Junio C Hamano wrote:
>> Jonathan Nieder <jrnieder@gmail.com> writes:
> 
>>>                              The real question is, do we consider the
>>> existing "git gc" infrastructure such a lost cause that we should
>>> touch it as little as possible?
>>
>> I am fine with that, as long as the "new" thing will take over what
>> "git gc" currently does.
> 
> Good reminder, thank you.
> 
> Yes, as long as we end up replacing the old thing, making a parallel
> new thing (e.g. with a config option for switching between during a
> transition period) can be a fine approach.

Thanks for the discussion, everyone. I'm sorry that I'm very late in
providing a new RFC that takes this approach, but yes I intend to create
the "single entry point" for maintenance activities, and incorporate
auto-GC as a default option there.

Something that is a good long-term goal is to have the new maintenance
entry point replace the "git gc --auto" calls, so we have better
customization of post-command "automatic" maintenance. This can be done
without any of the "background" part of my original RFC.

I've just been to busy with other tasks to devote the deep focus that
this feature deserves. Thanks for your patience.

-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-13 13:15 [PATCH 03/15] run-job: implement fetch job Son Luong Ngoc
@ 2020-04-13 14:14 ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-13 14:14 UTC (permalink / raw)
  To: Son Luong Ngoc, gitgitgadget; +Cc: dstolee, git, jrnieder, peff

On 4/13/2020 9:15 AM, Son Luong Ngoc wrote:
> Hi Derrick,
> 
> First of all, thanks a ton for upstreaming this.
> Despite multiple complaints about re-implementing cron in git,
> I see this as a huge improvement to git UX and it is very much welcome change.
> 
>> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>>    we can ensure that we actually load the new values somewhere in
>>    our refspace while not updating refs/heads or refs/remotes. By
>>    storing these refs here, the commit-graph job will update the
>>    commit-graph with the commits from these hidden refs.
> Ideally I think we want to let user configure which refs they want to
> prefetch with the default behavior being prefecting all HEADS
> available from remote.
> Using Facebook's Mercurial extension
> [RemoteFileLog](https://www.mercurial-scm.org/repo/hg/file/tip/hgext/remotefilelog/__init__.py#l31)
> as a UX reference,
> users should only prefetch the refs that they actually care about.

I will be sure to review this prior art before
submitting an update to this series.

>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>    decorate commits with twice as many refs if they appear at a
>>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>    there an easy way to exclude a refspace from decorations? Should
>>    we make refs/hidden/* a "special" refspace that is excluded from
>>    decorations?
> In git-log, there is
> [--decorate-refs-exclude](https://git-scm.com/docs/git-log#Documentation/git-log.txt---decorate-refs-excludeltpatterngt)
> which I think we can move into git-config as
> `log.decorate-refs-exclude`?
> If you let the `prefetch refs` be configurable as I suggested above, I
> think it make sense to have the git-log exclusions being configurable
> as well.

I was literally just working on exactly this when
your message arrived. I was going to name the config
option "log.decorateExclude".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
@ 2020-04-13 13:15 Son Luong Ngoc
  2020-04-13 14:14 ` Derrick Stolee
  0 siblings, 1 reply; 57+ messages in thread
From: Son Luong Ngoc @ 2020-04-13 13:15 UTC (permalink / raw)
  To: gitgitgadget; +Cc: dstolee, git, jrnieder, peff, stolee

Hi Derrick,

First of all, thanks a ton for upstreaming this.
Despite multiple complaints about re-implementing cron in git,
I see this as a huge improvement to git UX and it is very much welcome change.

> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
Ideally I think we want to let user configure which refs they want to
prefetch with the default behavior being prefecting all HEADS
available from remote.
Using Facebook's Mercurial extension
[RemoteFileLog](https://www.mercurial-scm.org/repo/hg/file/tip/hgext/remotefilelog/__init__.py#l31)
as a UX reference,
users should only prefetch the refs that they actually care about.

> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations? Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?
In git-log, there is
[--decorate-refs-exclude](https://git-scm.com/docs/git-log#Documentation/git-log.txt---decorate-refs-excludeltpatterngt)
which I think we can move into git-config as
`log.decorate-refs-exclude`?
If you let the `prefetch refs` be configurable as I suggested above, I
think it make sense to have the git-log exclusions being configurable
as well.

Cheers,
Son Luong.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, back to index

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10   ` Phillip Wood
2020-04-05 19:21     ` Junio C Hamano
2020-04-06 14:42       ` Derrick Stolee
2020-04-07  0:58         ` Danh Doan
2020-04-07 10:54           ` Derrick Stolee
2020-04-07 14:16             ` Danh Doan
2020-04-07 14:30               ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33   ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18   ` Phillip Wood
2020-04-06 12:49     ` Derrick Stolee
2020-04-05 15:41   ` Phillip Wood
2020-04-06 12:57     ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24   ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04  0:16   ` Derrick Stolee
2020-04-07  0:50     ` Danh Doan
2020-04-07 10:59       ` Derrick Stolee
2020-04-07 14:26         ` Danh Doan
2020-04-07 14:43           ` Johannes Schindelin
2020-04-07  1:48     ` brian m. carlson
2020-04-07 20:08       ` Junio C Hamano
2020-04-07 22:23       ` Johannes Schindelin
2020-04-08  0:01         ` brian m. carlson
2020-05-27 22:39           ` Josh Steadmon
2020-05-28  0:47             ` Junio C Hamano
2020-05-27 21:52               ` Johannes Schindelin
2020-05-28 14:48                 ` Junio C Hamano
2020-05-28 14:50                 ` Jonathan Nieder
2020-05-28 14:57                   ` Junio C Hamano
2020-05-28 15:03                     ` Jonathan Nieder
2020-05-28 15:30                       ` Derrick Stolee
2020-05-28  4:39                         ` Johannes Schindelin
2020-04-13 13:15 [PATCH 03/15] run-job: implement fetch job Son Luong Ngoc
2020-04-13 14:14 ` Derrick Stolee

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git