git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
@ 2020-07-07 14:21 Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 01/21] gc: use the_repository less often Derrick Stolee via GitGitGadget
                   ` (22 more replies)
  0 siblings, 23 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

This is a second attempt at redesigning Git's repository maintenance
patterns. The first attempt [1] included a way to run jobs in the background
using a long-lived process; that idea was rejected and is not included in
this series. A future series will use the OS to handle scheduling tasks.

[1] 
https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/

As mentioned before, git gc already plays the role of maintaining Git
repositories. It has accumulated several smaller pieces in its long history,
including:

 1. Repacking all reachable objects into one pack-file (and deleting
    unreachable objects).
 2. Packing refs.
 3. Expiring reflogs.
 4. Clearing rerere logs.
 5. Updating the commit-graph file.

While expiring reflogs, clearing rererelogs, and deleting unreachable
objects are suitable under the guise of "garbage collection", packing refs
and updating the commit-graph file are not as obviously fitting. Further,
these operations are "all or nothing" in that they rewrite almost all
repository data, which does not perform well at extremely large scales.
These operations can also be disruptive to foreground Git commands when git
gc --auto triggers during routine use.

This series does not intend to change what git gc does, but instead create
new choices for automatic maintenance activities, of which git gc remains
the only one enabled by default.

The new maintenance tasks are:

 * 'commit-graph' : write and verify a single layer of an incremental
   commit-graph.
 * 'loose-objects' : prune packed loose objects, then create a new pack from
   a batch of loose objects.
 * 'pack-files' : expire redundant packs from the multi-pack-index, then
   repack using the multi-pack-index's incremental repack strategy.
 * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.

These tasks are all disabled by default, but can be enabled with config
options or run explicitly using "git maintenance run --task=". There are
additional config options to allow customizing the conditions for which the
tasks run during the '--auto' option. ('fetch' will never run with the
'--auto' option.)

 Because 'gc' is implemented as a maintenance task, the most dramatic change
of this series is to convert the 'git gc --auto' calls into 'git maintenance
run --auto' calls at the end of some Git commands. By default, the only
change is that 'git gc --auto' will be run below an additional 'git
maintenance' process.

The 'git maintenance' builtin has a 'run' subcommand so it can be extended
later with subcommands that manage background maintenance, such as 'start',
'stop', 'pause', or 'schedule'. These are not the subject of this series, as
it is important to focus on the maintenance activities themselves.

An expert user could set up scheduled background maintenance themselves with
the current series. I have the following crontab data set up to run
maintenance on an hourly basis:

0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

My config includes all tasks except the 'gc' task. The hourly run is
over-aggressive, but is sufficient for testing. I'll replace it with daily
when I feel satisfied.

Hopefully this direction is seen as a positive one. My goal was to add more
options for expert users, along with the flexibility to create background
maintenance via the OS in a later series.

OUTLINE
=======

Patches 1-4 remove some references to the_repository in builtin/gc.c before
we start depending on code in that builtin.

Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
commands.

Patches 8-15 create new maintenance tasks. These are the same tasks sent in
the previous RFC.

Patches 16-21 create more customization through config and perform other
polish items.

FUTURE WORK
===========

 * Add 'start', 'stop', and 'schedule' subcommands to initialize the
   commands run in the background.
   
   
 * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
   default, but might have different '--auto' conditions and more config
   options.
   
   
 * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
   with use of the 'commit-graph' task.
   
   

Thanks, -Stolee

Derrick Stolee (21):
  gc: use the_repository less often
  gc: use repository in too_many_loose_objects()
  gc: use repo config
  gc: drop the_repository in log location
  maintenance: create basic maintenance runner
  maintenance: add --quiet option
  maintenance: replace run_auto_gc()
  maintenance: initialize task array and hashmap
  maintenance: add commit-graph task
  maintenance: add --task option
  maintenance: take a lock on the objects directory
  maintenance: add fetch task
  maintenance: add loose-objects task
  maintenance: add pack-files task
  maintenance: auto-size pack-files batch
  maintenance: create maintenance.<task>.enabled config
  maintenance: use pointers to check --auto
  maintenance: add auto condition for commit-graph task
  maintenance: create auto condition for loose-objects
  maintenance: add pack-files auto condition
  midx: use start_delayed_progress()

 .gitignore                           |   1 +
 Documentation/config.txt             |   2 +
 Documentation/config/maintenance.txt |  32 +
 Documentation/fetch-options.txt      |   5 +-
 Documentation/git-clone.txt          |   7 +-
 Documentation/git-maintenance.txt    | 124 ++++
 builtin.h                            |   1 +
 builtin/am.c                         |   2 +-
 builtin/commit.c                     |   2 +-
 builtin/fetch.c                      |   6 +-
 builtin/gc.c                         | 881 +++++++++++++++++++++++++--
 builtin/merge.c                      |   2 +-
 builtin/rebase.c                     |   4 +-
 commit-graph.c                       |   8 +-
 commit-graph.h                       |   1 +
 config.c                             |  24 +-
 config.h                             |   2 +
 git.c                                |   1 +
 midx.c                               |  12 +-
 midx.h                               |   1 +
 object.h                             |   1 +
 run-command.c                        |   7 +-
 run-command.h                        |   2 +-
 t/t5319-multi-pack-index.sh          |  14 +-
 t/t5510-fetch.sh                     |   2 +-
 t/t5514-fetch-multiple.sh            |   2 +-
 t/t7900-maintenance.sh               | 211 +++++++
 27 files changed, 1265 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh


base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/671
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 01/21] gc: use the_repository less often
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 02/21] gc: use repository in too_many_loose_objects() Derrick Stolee via GitGitGadget
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In a later change, we will consume several static methods in
builtin/gc.c for another builtin. Before doing so, let's clean up some
uses of the_repository. These specifically are centered around accesses
to the packed_git list.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 8e0b9cf41b..5c5e0df5bf 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -192,12 +192,13 @@ static int too_many_loose_objects(void)
 	return needed;
 }
 
-static struct packed_git *find_base_packs(struct string_list *packs,
+static struct packed_git *find_base_packs(struct repository *r,
+					  struct string_list *packs,
 					  unsigned long limit)
 {
 	struct packed_git *p, *base = NULL;
 
-	for (p = get_all_packs(the_repository); p; p = p->next) {
+	for (p = get_all_packs(r); p; p = p->next) {
 		if (!p->pack_local)
 			continue;
 		if (limit) {
@@ -214,7 +215,7 @@ static struct packed_git *find_base_packs(struct string_list *packs,
 	return base;
 }
 
-static int too_many_packs(void)
+static int too_many_packs(struct repository *r)
 {
 	struct packed_git *p;
 	int cnt;
@@ -222,7 +223,7 @@ static int too_many_packs(void)
 	if (gc_auto_pack_limit <= 0)
 		return 0;
 
-	for (cnt = 0, p = get_all_packs(the_repository); p; p = p->next) {
+	for (cnt = 0, p = get_all_packs(r); p; p = p->next) {
 		if (!p->pack_local)
 			continue;
 		if (p->pack_keep)
@@ -334,7 +335,7 @@ static void add_repack_incremental_option(void)
 	argv_array_push(&repack, "--no-write-bitmap-index");
 }
 
-static int need_to_gc(void)
+static int need_to_gc(struct repository *r)
 {
 	/*
 	 * Setting gc.auto to 0 or negative can disable the
@@ -349,18 +350,18 @@ static int need_to_gc(void)
 	 * we run "repack -A -d -l".  Otherwise we tell the caller
 	 * there is no need.
 	 */
-	if (too_many_packs()) {
+	if (too_many_packs(r)) {
 		struct string_list keep_pack = STRING_LIST_INIT_NODUP;
 
 		if (big_pack_threshold) {
-			find_base_packs(&keep_pack, big_pack_threshold);
+			find_base_packs(r, &keep_pack, big_pack_threshold);
 			if (keep_pack.nr >= gc_auto_pack_limit) {
 				big_pack_threshold = 0;
 				string_list_clear(&keep_pack, 0);
-				find_base_packs(&keep_pack, 0);
+				find_base_packs(r, &keep_pack, 0);
 			}
 		} else {
-			struct packed_git *p = find_base_packs(&keep_pack, 0);
+			struct packed_git *p = find_base_packs(r, &keep_pack, 0);
 			uint64_t mem_have, mem_want;
 
 			mem_have = total_ram();
@@ -523,6 +524,7 @@ static void gc_before_repack(void)
 
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
+	struct repository *r = the_repository;
 	int aggressive = 0;
 	int auto_gc = 0;
 	int quiet = 0;
@@ -589,7 +591,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		/*
 		 * Auto-gc should be least intrusive as possible.
 		 */
-		if (!need_to_gc())
+		if (!need_to_gc(r))
 			return 0;
 		if (!quiet) {
 			if (detach_auto)
@@ -623,9 +625,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 		if (keep_base_pack != -1) {
 			if (keep_base_pack)
-				find_base_packs(&keep_pack, 0);
+				find_base_packs(r, &keep_pack, 0);
 		} else if (big_pack_threshold) {
-			find_base_packs(&keep_pack, big_pack_threshold);
+			find_base_packs(r, &keep_pack, big_pack_threshold);
 		}
 
 		add_repack_all_option(&keep_pack);
@@ -652,7 +654,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	gc_before_repack();
 
 	if (!repository_format_precious_objects) {
-		close_object_store(the_repository->objects);
+		close_object_store(r->objects);
 		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
 			die(FAILED_RUN, repack.argv[0]);
 
@@ -678,15 +680,15 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		die(FAILED_RUN, rerere.argv[0]);
 
 	report_garbage = report_pack_garbage;
-	reprepare_packed_git(the_repository);
+	reprepare_packed_git(r);
 	if (pack_garbage.nr > 0) {
-		close_object_store(the_repository->objects);
+		close_object_store(r->objects);
 		clean_pack_garbage();
 	}
 
-	prepare_repo_settings(the_repository);
-	if (the_repository->settings.gc_write_commit_graph == 1)
-		write_commit_graph_reachable(the_repository->objects->odb,
+	prepare_repo_settings(r);
+	if (r->settings.gc_write_commit_graph == 1)
+		write_commit_graph_reachable(r->objects->odb,
 					     !quiet && !daemonized ? COMMIT_GRAPH_WRITE_PROGRESS : 0,
 					     NULL);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 02/21] gc: use repository in too_many_loose_objects()
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 01/21] gc: use the_repository less often Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 03/21] gc: use repo config Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change performed a mechanical change to swap the_repository
for a struct repository pointer when the use of the_repository was
obvious. However, the too_many_loose_objects() method uses git_path()
instead of repo_git_path(), which implies a hidden dependence on
the_repository.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 5c5e0df5bf..6d8267cecb 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -159,7 +159,7 @@ static void gc_config(void)
 	git_config(git_default_config, NULL);
 }
 
-static int too_many_loose_objects(void)
+static int too_many_loose_objects(struct repository *r)
 {
 	/*
 	 * Quickly check if a "gc" is needed, by estimating how
@@ -174,7 +174,7 @@ static int too_many_loose_objects(void)
 	int needed = 0;
 	const unsigned hexsz_loose = the_hash_algo->hexsz - 2;
 
-	dir = opendir(git_path("objects/17"));
+	dir = opendir(repo_git_path(r, "objects/17"));
 	if (!dir)
 		return 0;
 
@@ -378,7 +378,7 @@ static int need_to_gc(struct repository *r)
 
 		add_repack_all_option(&keep_pack);
 		string_list_clear(&keep_pack, 0);
-	} else if (too_many_loose_objects())
+	} else if (too_many_loose_objects(r))
 		add_repack_incremental_option();
 	else
 		return 0;
@@ -692,7 +692,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 					     !quiet && !daemonized ? COMMIT_GRAPH_WRITE_PROGRESS : 0,
 					     NULL);
 
-	if (auto_gc && too_many_loose_objects())
+	if (auto_gc && too_many_loose_objects(r))
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 03/21] gc: use repo config
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 01/21] gc: use the_repository less often Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 02/21] gc: use repository in too_many_loose_objects() Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 04/21] gc: drop the_repository in log location Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The gc builtin is working towards using a repository pointer instead of
relying on the_repository everywhere. Some of the instances of
the_repository are hidden underneath git_config...() methods. Expose
them by using repo_config...() instead.

One method did not previously have a repo equivalent:
git_config_get_expiry().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 37 +++++++++++++++++++------------------
 config.c     | 24 +++++++++++++++---------
 config.h     |  2 ++
 3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 6d8267cecb..888b6444d6 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -116,12 +116,13 @@ static void process_log_file_on_signal(int signo)
 	raise(signo);
 }
 
-static int gc_config_is_timestamp_never(const char *var)
+static int gc_config_is_timestamp_never(struct repository *r,
+					const char *var)
 {
 	const char *value;
 	timestamp_t expire;
 
-	if (!git_config_get_value(var, &value) && value) {
+	if (!repo_config_get_value(r, var, &value) && value) {
 		if (parse_expiry_date(value, &expire))
 			die(_("failed to parse '%s' value '%s'"), var, value);
 		return expire == 0;
@@ -129,34 +130,34 @@ static int gc_config_is_timestamp_never(const char *var)
 	return 0;
 }
 
-static void gc_config(void)
+static void gc_config(struct repository *r)
 {
 	const char *value;
 
-	if (!git_config_get_value("gc.packrefs", &value)) {
+	if (!repo_config_get_value(r, "gc.packrefs", &value)) {
 		if (value && !strcmp(value, "notbare"))
 			pack_refs = -1;
 		else
 			pack_refs = git_config_bool("gc.packrefs", value);
 	}
 
-	if (gc_config_is_timestamp_never("gc.reflogexpire") &&
-	    gc_config_is_timestamp_never("gc.reflogexpireunreachable"))
+	if (gc_config_is_timestamp_never(r, "gc.reflogexpire") &&
+	    gc_config_is_timestamp_never(r, "gc.reflogexpireunreachable"))
 		prune_reflogs = 0;
 
-	git_config_get_int("gc.aggressivewindow", &aggressive_window);
-	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
-	git_config_get_int("gc.auto", &gc_auto_threshold);
-	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
-	git_config_get_bool("gc.autodetach", &detach_auto);
-	git_config_get_expiry("gc.pruneexpire", &prune_expire);
-	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
-	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
+	repo_config_get_int(r, "gc.aggressivewindow", &aggressive_window);
+	repo_config_get_int(r, "gc.aggressivedepth", &aggressive_depth);
+	repo_config_get_int(r, "gc.auto", &gc_auto_threshold);
+	repo_config_get_int(r, "gc.autopacklimit", &gc_auto_pack_limit);
+	repo_config_get_bool(r, "gc.autodetach", &detach_auto);
+	repo_config_get_expiry(r, "gc.pruneexpire", &prune_expire);
+	repo_config_get_expiry(r, "gc.worktreepruneexpire", &prune_worktrees_expire);
+	repo_config_get_expiry(r, "gc.logexpiry", &gc_log_expire);
 
-	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
-	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
+	repo_config_get_ulong(r, "gc.bigpackthreshold", &big_pack_threshold);
+	repo_config_get_ulong(r, "pack.deltacachesize", &max_delta_cache_size);
 
-	git_config(git_default_config, NULL);
+	repo_config(r, git_default_config, NULL);
 }
 
 static int too_many_loose_objects(struct repository *r)
@@ -562,7 +563,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	argv_array_pushl(&rerere, "rerere", "gc", NULL);
 
 	/* default expiry time, overwritten in gc_config */
-	gc_config();
+	gc_config(r);
 	if (parse_expiry_date(gc_log_expire, &gc_log_expire_time))
 		die(_("failed to parse gc.logexpiry value %s"), gc_log_expire);
 
diff --git a/config.c b/config.c
index 8db9c77098..fe83f8b67c 100644
--- a/config.c
+++ b/config.c
@@ -2211,6 +2211,20 @@ int repo_config_get_pathname(struct repository *repo,
 	return ret;
 }
 
+int repo_config_get_expiry(struct repository *repo,
+			   const char *key, const char **output)
+{
+	int ret = repo_config_get_string_const(repo, key, output);
+	if (ret)
+		return ret;
+	if (strcmp(*output, "now")) {
+		timestamp_t now = approxidate("now");
+		if (approxidate(*output) >= now)
+			git_die_config(key, _("Invalid %s: '%s'"), key, *output);
+	}
+	return ret;
+}
+
 /* Functions used historically to read configuration from 'the_repository' */
 void git_config(config_fn_t fn, void *data)
 {
@@ -2274,15 +2288,7 @@ int git_config_get_pathname(const char *key, const char **dest)
 
 int git_config_get_expiry(const char *key, const char **output)
 {
-	int ret = git_config_get_string_const(key, output);
-	if (ret)
-		return ret;
-	if (strcmp(*output, "now")) {
-		timestamp_t now = approxidate("now");
-		if (approxidate(*output) >= now)
-			git_die_config(key, _("Invalid %s: '%s'"), key, *output);
-	}
-	return ret;
+	return repo_config_get_expiry(the_repository, key, output);
 }
 
 int git_config_get_expiry_in_days(const char *key, timestamp_t *expiry, timestamp_t now)
diff --git a/config.h b/config.h
index 060874488f..085bcf6917 100644
--- a/config.h
+++ b/config.h
@@ -490,6 +490,8 @@ int repo_config_get_maybe_bool(struct repository *repo,
 			       const char *key, int *dest);
 int repo_config_get_pathname(struct repository *repo,
 			     const char *key, const char **dest);
+int repo_config_get_expiry(struct repository *repo,
+			   const char *key, const char **output);
 
 /**
  * Querying For Specific Variables
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 04/21] gc: drop the_repository in log location
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 03/21] gc: use repo config Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-09  2:22   ` Jonathan Tan
  2020-07-07 14:21 ` [PATCH 05/21] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The report_last_gc_error() method use git_pathdup() which implicitly
uses the_repository. Replace this with strbuf_repo_path() to get a
path buffer we control that uses a given repository pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 888b6444d6..58a00be5af 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -462,28 +462,30 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
  * gc should not proceed due to an error in the last run. Prints a
  * message and returns -1 if an error occurred while reading gc.log
  */
-static int report_last_gc_error(void)
+static int report_last_gc_error(struct repository *r)
 {
 	struct strbuf sb = STRBUF_INIT;
 	int ret = 0;
 	ssize_t len;
 	struct stat st;
-	char *gc_log_path = git_pathdup("gc.log");
+	struct strbuf gc_log_path = STRBUF_INIT;
 
-	if (stat(gc_log_path, &st)) {
+	strbuf_repo_git_path(&gc_log_path, r, "gc.log");
+
+	if (stat(gc_log_path.buf, &st)) {
 		if (errno == ENOENT)
 			goto done;
 
-		ret = error_errno(_("cannot stat '%s'"), gc_log_path);
+		ret = error_errno(_("cannot stat '%s'"), gc_log_path.buf);
 		goto done;
 	}
 
 	if (st.st_mtime < gc_log_expire_time)
 		goto done;
 
-	len = strbuf_read_file(&sb, gc_log_path, 0);
+	len = strbuf_read_file(&sb, gc_log_path.buf, 0);
 	if (len < 0)
-		ret = error_errno(_("cannot read '%s'"), gc_log_path);
+		ret = error_errno(_("cannot read '%s'"), gc_log_path.buf);
 	else if (len > 0) {
 		/*
 		 * A previous gc failed.  Report the error, and don't
@@ -496,12 +498,12 @@ static int report_last_gc_error(void)
 			       "Automatic cleanup will not be performed "
 			       "until the file is removed.\n\n"
 			       "%s"),
-			    gc_log_path, sb.buf);
+			    gc_log_path.buf, sb.buf);
 		ret = 1;
 	}
 	strbuf_release(&sb);
 done:
-	free(gc_log_path);
+	strbuf_release(&gc_log_path);
 	return ret;
 }
 
@@ -602,7 +604,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
 		if (detach_auto) {
-			int ret = report_last_gc_error();
+			int ret = report_last_gc_error(r);
 			if (ret < 0)
 				/* an I/O error occurred, already reported */
 				exit(128);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 05/21] maintenance: create basic maintenance runner
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 04/21] gc: drop the_repository in log location Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 06/21] maintenance: add --quiet option Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'gc' builtin is our current entrypoint for automatically maintaining
a repository. This one tool does many operations, such as repacking the
repository, packing refs, and rewriting the commit-graph file. The name
implies it performs "garbage collection" which means several different
things, and some users may not want to use this operation that rewrites
the entire object database.

Create a new 'maintenance' builtin that will become a more general-
purpose command. To start, it will only support the 'run' subcommand,
but will later expand to add subcommands for scheduling maintenance in
the background.

For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin.
In fact, the only option is the '--auto' toggle, which is handed
directly to the 'gc' builtin. The current change is isolated to this
simple operation to prevent more interesting logic from being lost in
all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the
two builtins. It is possible that we will have 'maintenance' replace the
'gc' builtin entirely at some point, leaving 'git gc' as an alias for
some specific arguments to 'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                        |  1 +
 Documentation/git-maintenance.txt | 57 +++++++++++++++++++++++++++++
 builtin.h                         |  1 +
 builtin/gc.c                      | 61 +++++++++++++++++++++++++++++++
 git.c                             |  1 +
 t/t7900-maintenance.sh            | 22 +++++++++++
 6 files changed, 143 insertions(+)
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh

diff --git a/.gitignore b/.gitignore
index ee509a2ad2..a5808fa30d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -90,6 +90,7 @@
 /git-ls-tree
 /git-mailinfo
 /git-mailsplit
+/git-maintenance
 /git-merge
 /git-merge-base
 /git-merge-index
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
new file mode 100644
index 0000000000..34cd2b4417
--- /dev/null
+++ b/Documentation/git-maintenance.txt
@@ -0,0 +1,57 @@
+git-maintenance(1)
+==================
+
+NAME
+----
+git-maintenance - Run tasks to optimize Git repository data
+
+
+SYNOPSIS
+--------
+[verse]
+'git maintenance' run [<options>]
+
+
+DESCRIPTION
+-----------
+Run tasks to optimize Git repository data, speeding up other Git commands
+and reducing storage requirements for the repository.
++
+Git commands that add repository data, such as `git add` or `git fetch`,
+are optimized for a responsive user experience. These commands do not take
+time to optimize the Git data, since such optimizations scale with the full
+size of the repository while these user commands each perform a relatively
+small action.
++
+The `git maintenance` command provides flexibility for how to optimize the
+Git repository.
+
+SUBCOMMANDS
+-----------
+
+run::
+	Run one or more maintenance tasks.
+
+TASKS
+-----
+
+gc::
+	Cleanup unnecessary files and optimize the local repository. "GC"
+	stands for "garbage collection," but this task performs many
+	smaller tasks. This task can be rather expensive for large
+	repositories, as it repacks all Git objects into a single pack-file.
+	It can also be disruptive in some situations, as it deletes stale
+	data.
+
+OPTIONS
+-------
+--auto::
+	When combined with the `run` subcommand, run maintenance tasks
+	only if certain thresholds are met. For example, the `gc` task
+	runs when the number of loose objects exceeds the number stored
+	in the `gc.auto` config setting, or when the number of pack-files
+	exceeds the `gc.autoPackLimit` config setting.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/builtin.h b/builtin.h
index a5ae15bfe5..17c1c0ce49 100644
--- a/builtin.h
+++ b/builtin.h
@@ -167,6 +167,7 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix);
 int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 int cmd_mailsplit(int argc, const char **argv, const char *prefix);
+int cmd_maintenance(int argc, const char **argv, const char *prefix);
 int cmd_merge(int argc, const char **argv, const char *prefix);
 int cmd_merge_base(int argc, const char **argv, const char *prefix);
 int cmd_merge_index(int argc, const char **argv, const char *prefix);
diff --git a/builtin/gc.c b/builtin/gc.c
index 58a00be5af..07cc5f46ae 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -704,3 +704,64 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	return 0;
 }
+
+static const char * const builtin_maintenance_usage[] = {
+	N_("git maintenance run [<options>]"),
+	NULL
+};
+
+struct maintenance_opts {
+	int auto_flag;
+} opts;
+
+static int maintenance_task_gc(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "gc", NULL);
+
+	if (opts.auto_flag)
+		argv_array_pushl(&cmd, "--auto", NULL);
+
+	close_object_store(r->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int maintenance_run(struct repository *r)
+{
+	return maintenance_task_gc(r);
+}
+
+int cmd_maintenance(int argc, const char **argv, const char *prefix)
+{
+	struct repository *r = the_repository;
+
+	static struct option builtin_maintenance_options[] = {
+		OPT_BOOL(0, "auto", &opts.auto_flag,
+			 N_("run tasks based on the state of the repository")),
+		OPT_END()
+	};
+
+	memset(&opts, 0, sizeof(opts));
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_maintenance_usage,
+				   builtin_maintenance_options);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_maintenance_options,
+			     builtin_maintenance_usage,
+			     PARSE_OPT_KEEP_UNKNOWN);
+
+	if (argc == 1) {
+		if (!strcmp(argv[0], "run"))
+			return maintenance_run(r);
+	}
+
+	usage_with_options(builtin_maintenance_usage,
+			   builtin_maintenance_options);
+}
diff --git a/git.c b/git.c
index 2f021b97f3..ff56d1df24 100644
--- a/git.c
+++ b/git.c
@@ -527,6 +527,7 @@ static struct cmd_struct commands[] = {
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "mailsplit", cmd_mailsplit, NO_PARSEOPT },
+	{ "maintenance", cmd_maintenance, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "merge", cmd_merge, RUN_SETUP | NEED_WORK_TREE },
 	{ "merge-base", cmd_merge_base, RUN_SETUP },
 	{ "merge-file", cmd_merge_file, RUN_SETUP_GENTLY },
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
new file mode 100755
index 0000000000..d00641c4dd
--- /dev/null
+++ b/t/t7900-maintenance.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+test_description='git maintenance builtin'
+
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
+
+. ./test-lib.sh
+
+test_expect_success 'help text' '
+	test_must_fail git maintenance -h 2>err &&
+	test_i18ngrep "usage: git maintenance run" err
+'
+
+test_expect_success 'gc [--auto]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	grep ",\"gc\"]" run-no-auto.txt  &&
+	grep ",\"gc\",\"--auto\"]" run-auto.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 06/21] maintenance: add --quiet option
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 05/21] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 07/21] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Maintenance activities are commonly used as steps in larger scripts.
Providing a '--quiet' option allows those scripts to be less noisy when
run on a terminal window. Turn this mode on by default when stderr is
not a terminal.

Pipe the option to the 'git gc' child process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 3 +++
 builtin/gc.c                      | 7 +++++++
 t/t7900-maintenance.sh            | 8 +++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 34cd2b4417..089fa4cedc 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -52,6 +52,9 @@ OPTIONS
 	in the `gc.auto` config setting, or when the number of pack-files
 	exceeds the `gc.autoPackLimit` config setting.
 
+--quiet::
+	Do not report progress or other information over `stderr`.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index 07cc5f46ae..3881a99e9d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -712,6 +712,7 @@ static const char * const builtin_maintenance_usage[] = {
 
 struct maintenance_opts {
 	int auto_flag;
+	int quiet;
 } opts;
 
 static int maintenance_task_gc(struct repository *r)
@@ -723,6 +724,8 @@ static int maintenance_task_gc(struct repository *r)
 
 	if (opts.auto_flag)
 		argv_array_pushl(&cmd, "--auto", NULL);
+	if (opts.quiet)
+		argv_array_pushl(&cmd, "--quiet", NULL);
 
 	close_object_store(r->objects);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
@@ -743,6 +746,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 	static struct option builtin_maintenance_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
+		OPT_BOOL(0, "quiet", &opts.quiet,
+			 N_("do not report progress or other information over stderr")),
 		OPT_END()
 	};
 
@@ -752,6 +757,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_maintenance_usage,
 				   builtin_maintenance_options);
 
+	opts.quiet = !isatty(2);
+
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
 			     builtin_maintenance_usage,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index d00641c4dd..e4e4036e50 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -12,11 +12,13 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git maintenance run" err
 '
 
-test_expect_success 'gc [--auto]' '
-	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+test_expect_success 'gc [--auto|--quiet]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
 	grep ",\"gc\"]" run-no-auto.txt  &&
-	grep ",\"gc\",\"--auto\"]" run-auto.txt
+	grep ",\"gc\",\"--auto\"" run-auto.txt &&
+	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 07/21] maintenance: replace run_auto_gc()
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 06/21] maintenance: add --quiet option Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 08/21] maintenance: initialize task array and hashmap Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The run_auto_gc() method is used in several places to trigger a check
for repo maintenance after some Git commands, such as 'git commit' or
'git fetch'.

To allow for extra customization of this maintenance activity, replace
the 'git gc --auto [--quiet]' call with one to 'git maintenance run
--auto [--quiet]'. As we extend the maintenance builtin with other
steps, users will be able to select different maintenance activities.

Rename run_auto_gc() to run_auto_maintenance() to be clearer what is
happening on this call, and to expose all callers in the current diff.

Since 'git fetch' already allows disabling the 'git gc --auto'
subprocess, add an equivalent option with a different name to be more
descriptive of the new behavior: '--[no-]maintenance'. Update the
documentation to include these options at the same time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/fetch-options.txt | 5 +++--
 Documentation/git-clone.txt     | 7 ++++---
 builtin/am.c                    | 2 +-
 builtin/commit.c                | 2 +-
 builtin/fetch.c                 | 6 ++++--
 builtin/merge.c                 | 2 +-
 builtin/rebase.c                | 4 ++--
 run-command.c                   | 7 +++++--
 run-command.h                   | 2 +-
 t/t5510-fetch.sh                | 2 +-
 10 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 6e2a160a47..d73224844e 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -86,9 +86,10 @@ ifndef::git-pull[]
 	Allow several <repository> and <group> arguments to be
 	specified. No <refspec>s may be specified.
 
+--[no-]maintenance::
 --[no-]auto-gc::
-	Run `git gc --auto` at the end to perform garbage collection
-	if needed. This is enabled by default.
+	Run `git maintenance run --auto` at the end to perform garbage
+	collection if needed. This is enabled by default.
 
 --[no-]write-commit-graph::
 	Write a commit-graph after fetching. This overrides the config
diff --git a/Documentation/git-clone.txt b/Documentation/git-clone.txt
index c898310099..aa25aba7d9 100644
--- a/Documentation/git-clone.txt
+++ b/Documentation/git-clone.txt
@@ -78,9 +78,10 @@ repository using this option and then delete branches (or use any
 other Git command that makes any existing commit unreferenced) in the
 source repository, some objects may become unreferenced (or dangling).
 These objects may be removed by normal Git operations (such as `git commit`)
-which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
-If these objects are removed and were referenced by the cloned repository,
-then the cloned repository will become corrupt.
+which automatically call `git maintenance run --auto` and `git gc --auto`.
+(See linkgit:git-maintenance[1] and linkgit:git-gc[1].) If these objects
+are removed and were referenced by the cloned repository, then the cloned
+repository will become corrupt.
 +
 Note that running `git repack` without the `--local` option in a repository
 cloned with `--shared` will copy objects from the source repository into a pack
diff --git a/builtin/am.c b/builtin/am.c
index 69e50de018..ff895125f6 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -1795,7 +1795,7 @@ static void am_run(struct am_state *state, int resume)
 	if (!state->rebasing) {
 		am_destroy(state);
 		close_object_store(the_repository->objects);
-		run_auto_gc(state->quiet);
+		run_auto_maintenance(state->quiet);
 	}
 }
 
diff --git a/builtin/commit.c b/builtin/commit.c
index d1b7396052..658b158659 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1702,7 +1702,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 	git_test_write_commit_graph_or_die();
 
 	repo_rerere(the_repository, 0);
-	run_auto_gc(quiet);
+	run_auto_maintenance(quiet);
 	run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
 	if (amend && !no_post_rewrite) {
 		commit_post_rewrite(the_repository, current_head, &oid);
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 82ac4be8a5..49a4d727d4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
 	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
 			N_("report that we have only objects reachable from this object")),
 	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
+	OPT_BOOL(0, "maintenance", &enable_auto_gc,
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
-		 N_("run 'gc --auto' after fetching")),
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "show-forced-updates", &fetch_show_forced_updates,
 		 N_("check for forced-updates on all updated branches")),
 	OPT_BOOL(0, "write-commit-graph", &fetch_write_commit_graph,
@@ -1882,7 +1884,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	close_object_store(the_repository->objects);
 
 	if (enable_auto_gc)
-		run_auto_gc(verbosity < 0);
+		run_auto_maintenance(verbosity < 0);
 
 	return result;
 }
diff --git a/builtin/merge.c b/builtin/merge.c
index 7da707bf55..c068e73037 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -457,7 +457,7 @@ static void finish(struct commit *head_commit,
 			 * user should see them.
 			 */
 			close_object_store(the_repository->objects);
-			run_auto_gc(verbosity < 0);
+			run_auto_maintenance(verbosity < 0);
 		}
 	}
 	if (new_head && show_diffstat) {
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 37ba76ac3d..0c4ee98f08 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -728,10 +728,10 @@ static int finish_rebase(struct rebase_options *opts)
 	apply_autostash(state_dir_path("autostash", opts));
 	close_object_store(the_repository->objects);
 	/*
-	 * We ignore errors in 'gc --auto', since the
+	 * We ignore errors in 'git maintenance run --auto', since the
 	 * user should see them.
 	 */
-	run_auto_gc(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
+	run_auto_maintenance(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
 	if (opts->type == REBASE_MERGE) {
 		struct replay_opts replay = REPLAY_OPTS_INIT;
 
diff --git a/run-command.c b/run-command.c
index 9b3a57d1e3..82ad241638 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
 	return result;
 }
 
-int run_auto_gc(int quiet)
+int run_auto_maintenance(int quiet)
 {
 	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
 	int status;
 
-	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
+	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
 	if (quiet)
 		argv_array_push(&argv_gc_auto, "--quiet");
+	else
+		argv_array_push(&argv_gc_auto, "--no-quiet");
+
 	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
 	argv_array_clear(&argv_gc_auto);
 	return status;
diff --git a/run-command.h b/run-command.h
index 191dfcdafe..d9a800e700 100644
--- a/run-command.h
+++ b/run-command.h
@@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
 /*
  * Trigger an auto-gc
  */
-int run_auto_gc(int quiet);
+int run_auto_maintenance(int quiet);
 
 #define RUN_COMMAND_NO_STDIN 1
 #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index a66dbe0bde..9850ecde5d 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
 		git config fetch.unpackLimit 1 &&
 		git config gc.autoPackLimit 1 &&
 		git config gc.autoDetach false &&
-		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
+		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
 		test_i18ngrep "Auto packing the repository" fetch.out &&
 		! grep "Should I try again" fetch.out
 	)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 08/21] maintenance: initialize task array and hashmap
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 07/21] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-09  2:25   ` Jonathan Tan
  2020-07-07 14:21 ` [PATCH 09/21] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of implementing multiple maintenance tasks inside the
'maintenance' builtin, use a list and hashmap of structs to describe the
work to be done.

The struct maintenance_task stores the name of the task (as given by a
future command-line argument) along with a function pointer to its
implementation and a boolean for whether the step is enabled.

A list of pointers to these structs are initialized with the full list
of implemented tasks along with a default order. For now, this list only
contains the "gc" task. This task is also the only task enabled by
default.

This list is also inserted into a hashmap. This allows command-line
arguments to quickly find the tasks by name, not sensitive to case. To
ensure this list and hashmap work well together, the list only contains
pointers to the struct information. This will allow a sort on the list
while preserving the hashmap data.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 3881a99e9d..c143bf50df 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -705,6 +705,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
+#define MAX_NUM_TASKS 1
+
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
 	NULL
@@ -734,9 +736,67 @@ static int maintenance_task_gc(struct repository *r)
 	return result;
 }
 
+typedef int maintenance_task_fn(struct repository *r);
+
+struct maintenance_task {
+	struct hashmap_entry ent;
+	const char *name;
+	maintenance_task_fn *fn;
+	unsigned enabled:1;
+};
+
+static int task_entry_cmp(const void *unused_cmp_data,
+			  const struct hashmap_entry *eptr,
+			  const struct hashmap_entry *entry_or_key,
+			  const void *keydata)
+{
+	const struct maintenance_task *e1, *e2;
+	const char *name = keydata;
+
+	e1 = container_of(eptr, const struct maintenance_task, ent);
+	e2 = container_of(entry_or_key, const struct maintenance_task, ent);
+
+	return strcasecmp(e1->name, name ? name : e2->name);
+}
+
+struct maintenance_task *tasks[MAX_NUM_TASKS];
+int num_tasks;
+struct hashmap task_map;
+
 static int maintenance_run(struct repository *r)
 {
-	return maintenance_task_gc(r);
+	int i;
+	int result = 0;
+
+	for (i = 0; !result && i < num_tasks; i++) {
+		if (!tasks[i]->enabled)
+			continue;
+		result = tasks[i]->fn(r);
+	}
+
+	return result;
+}
+
+static void initialize_tasks(void)
+{
+	int i;
+	num_tasks = 0;
+
+	for (i = 0; i < MAX_NUM_TASKS; i++)
+		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
+
+	tasks[num_tasks]->name = "gc";
+	tasks[num_tasks]->fn = maintenance_task_gc;
+	tasks[num_tasks]->enabled = 1;
+	num_tasks++;
+
+	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
+
+	for (i = 0; i < num_tasks; i++) {
+		hashmap_entry_init(&tasks[i]->ent,
+				   strihash(tasks[i]->name));
+		hashmap_add(&task_map, &tasks[i]->ent);
+	}
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
@@ -758,6 +818,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 				   builtin_maintenance_options);
 
 	opts.quiet = !isatty(2);
+	initialize_tasks();
 
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 09/21] maintenance: add commit-graph task
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 08/21] maintenance: initialize task array and hashmap Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-09  2:29   ` Jonathan Tan
  2020-07-07 14:21 ` [PATCH 10/21] maintenance: add --task option Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The first new task in the 'git maintenance' builtin is the
'commit-graph' job. It is based on the sequence of events in the
'commit-graph' job in Scalar [1]. This sequence is as follows:

1. git commit-graph write --reachable --split
2. git commit-graph verify --shallow
3. If the verify succeeds, stop.
4. Delete the commit-graph-chain file.
5. git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

By using 'git commit-graph verify --shallow' we can ensure that
the file we just wrote is valid. This is an extra safety precaution
that is faster than our 'write' subcommand. In the rare situation
that the newest layer of the commit-graph is corrupt, we can "fix"
the corruption by deleting the commit-graph-chain file and rewrite
the full commit-graph as a new one-layer commit graph. This does
not completely prevent _that_ file from being corrupt, but it does
recompute the commit-graph by parsing commits from the object
database. In our use of this step in Scalar and VFS for Git, we
have only seen this issue arise because our microsoft/git fork
reverted 43d3561 ("commit-graph write: don't die if the existing
graph is corrupt" 2019-03-25) for a while to keep commit-graph
writes very fast. We dropped the revert when updating to v2.23.0.
The verify still has potential for catching corrupt data across
the layer boundary: if the new file has commit X with parent Y
in an old file but the commit ID for Y in the old file had a
bitswap, then we will notice that in the 'verify' command.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 18 ++++++++
 builtin/gc.c                      | 76 ++++++++++++++++++++++++++++++-
 commit-graph.c                    |  8 ++--
 commit-graph.h                    |  1 +
 t/t7900-maintenance.sh            |  2 +-
 5 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 089fa4cedc..35b0be7d40 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -35,6 +35,24 @@ run::
 TASKS
 -----
 
+commit-graph::
+	The `commit-graph` job updates the `commit-graph` files incrementally,
+	then verifies that the written data is correct. If the new layer has an
+	issue, then the chain file is removed and the `commit-graph` is
+	rewritten from scratch.
++
+The verification only checks the top layer of the `commit-graph` chain.
+If the incremental write merged the new commits with at least one
+existing layer, then there is potential for on-disk corruption being
+carried forward into the new file. This will be noticed and the new
+commit-graph file will be clean as Git reparses the commit data from
+the object database.
++
+The incremental write is safe to run alongside concurrent Git processes
+since it will not expire `.graph` files that were in the previous
+`commit-graph-chain` file. They will be deleted by a later run based on
+the expiration delay.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index c143bf50df..a6b080627f 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -705,7 +705,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 1
+#define MAX_NUM_TASKS 2
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -717,6 +717,76 @@ struct maintenance_opts {
 	int quiet;
 } opts;
 
+static int run_write_commit_graph(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "commit-graph", "write",
+			 "--split", "--reachable",
+			 NULL);
+
+	if (opts.quiet)
+		argv_array_pushl(&cmd, "--no-progress", NULL);
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int run_verify_commit_graph(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "commit-graph", "verify",
+			 "--shallow", NULL);
+
+	if (opts.quiet)
+		argv_array_pushl(&cmd, "--no-progress", NULL);
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int maintenance_task_commit_graph(struct repository *r)
+{
+	char *chain_path;
+
+	/* Skip commit-graph when --auto is specified. */
+	if (opts.auto_flag)
+		return 0;
+
+	close_object_store(r->objects);
+	if (run_write_commit_graph(r)) {
+		error(_("failed to write commit-graph"));
+		return 1;
+	}
+
+	if (!run_verify_commit_graph(r))
+		return 0;
+
+	warning(_("commit-graph verify caught error, rewriting"));
+
+	chain_path = get_commit_graph_chain_filename(r->objects->odb);
+	if (unlink(chain_path)) {
+		UNLEAK(chain_path);
+		die(_("failed to remove commit-graph at %s"), chain_path);
+	}
+	free(chain_path);
+
+	if (!run_write_commit_graph(r))
+		return 0;
+
+	error(_("failed to rewrite commit-graph"));
+	return 1;
+}
+
 static int maintenance_task_gc(struct repository *r)
 {
 	int result;
@@ -790,6 +860,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->enabled = 1;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "commit-graph";
+	tasks[num_tasks]->fn = maintenance_task_commit_graph;
+	num_tasks++;
+
 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
 
 	for (i = 0; i < num_tasks; i++) {
diff --git a/commit-graph.c b/commit-graph.c
index fdd1c4fa7c..57278a9ab5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -172,7 +172,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
 		       oid_hex);
 }
 
-static char *get_chain_filename(struct object_directory *odb)
+char *get_commit_graph_chain_filename(struct object_directory *odb)
 {
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
@@ -520,7 +520,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	struct stat st;
 	struct object_id *oids;
 	int i = 0, valid = 1, count;
-	char *chain_name = get_chain_filename(odb);
+	char *chain_name = get_commit_graph_chain_filename(odb);
 	FILE *fp;
 	int stat_res;
 
@@ -1635,7 +1635,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	if (ctx->split) {
-		char *lock_name = get_chain_filename(ctx->odb);
+		char *lock_name = get_commit_graph_chain_filename(ctx->odb);
 
 		hold_lock_file_for_update_mode(&lk, lock_name,
 					       LOCK_DIE_ON_ERROR, 0444);
@@ -2012,7 +2012,7 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	if (ctx->split_opts && ctx->split_opts->expire_time)
 		expire_time = ctx->split_opts->expire_time;
 	if (!ctx->split) {
-		char *chain_file_name = get_chain_filename(ctx->odb);
+		char *chain_file_name = get_commit_graph_chain_filename(ctx->odb);
 		unlink(chain_file_name);
 		free(chain_file_name);
 		ctx->num_commit_graphs_after = 0;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..3c202748c3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -25,6 +25,7 @@ struct commit;
 struct bloom_filter_settings;
 
 char *get_commit_graph_filename(struct object_directory *odb);
+char *get_commit_graph_chain_filename(struct object_directory *odb);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
 
 /*
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index e4e4036e50..216ac0b19e 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -12,7 +12,7 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git maintenance run" err
 '
 
-test_expect_success 'gc [--auto|--quiet]' '
+test_expect_success 'run [--auto|--quiet]' '
 	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
 	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 10/21] maintenance: add --task option
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 09/21] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 11/21] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A user may want to only run certain maintenance tasks in a certain
order. Add the --task=<task> option, which allows a user to specify an
ordered list of tasks to run. These cannot be run multiple times,
however.

Here is where our array of maintenance_task pointers becomes critical.
We can sort the array of pointers based on the task order, but we do not
want to move the struct data itself in order to preserve the hashmap
references. We use the hashmap to match the --task=<task> arguments into
the task struct data.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  4 ++
 builtin/gc.c                      | 62 ++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            | 23 ++++++++++++
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 35b0be7d40..9204762e21 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,10 @@ OPTIONS
 --quiet::
 	Do not report progress or other information over `stderr`.
 
+--task=<task>::
+	If this option is specified one or more times, then only run the
+	specified tasks in the specified order.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index a6b080627f..8f2143862c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -715,6 +715,7 @@ static const char * const builtin_maintenance_usage[] = {
 struct maintenance_opts {
 	int auto_flag;
 	int quiet;
+	int tasks_selected;
 } opts;
 
 static int run_write_commit_graph(struct repository *r)
@@ -812,7 +813,9 @@ struct maintenance_task {
 	struct hashmap_entry ent;
 	const char *name;
 	maintenance_task_fn *fn;
-	unsigned enabled:1;
+	int task_order;
+	unsigned enabled:1,
+		 selected:1;
 };
 
 static int task_entry_cmp(const void *unused_cmp_data,
@@ -833,14 +836,30 @@ struct maintenance_task *tasks[MAX_NUM_TASKS];
 int num_tasks;
 struct hashmap task_map;
 
+static int compare_tasks_by_selection(const void *a_, const void *b_)
+{
+	const struct maintenance_task *a, *b;
+	a = (const struct maintenance_task *)a_;
+	b = (const struct maintenance_task *)b_;
+
+	return b->task_order - a->task_order;
+}
+
 static int maintenance_run(struct repository *r)
 {
 	int i;
 	int result = 0;
 
+	if (opts.tasks_selected)
+		QSORT(tasks, num_tasks, compare_tasks_by_selection);
+
 	for (i = 0; !result && i < num_tasks; i++) {
-		if (!tasks[i]->enabled)
+		if (opts.tasks_selected && !tasks[i]->selected)
+			continue;
+
+		if (!opts.tasks_selected && !tasks[i]->enabled)
 			continue;
+
 		result = tasks[i]->fn(r);
 	}
 
@@ -873,6 +892,42 @@ static void initialize_tasks(void)
 	}
 }
 
+static int task_option_parse(const struct option *opt,
+			     const char *arg, int unset)
+{
+	struct maintenance_task *task;
+	struct maintenance_task key;
+
+	BUG_ON_OPT_NEG(unset);
+
+	if (!arg || !strlen(arg)) {
+		error(_("--task requires a value"));
+		return 1;
+	}
+
+	opts.tasks_selected++;
+
+	key.name = arg;
+	hashmap_entry_init(&key.ent, strihash(key.name));
+
+	task = hashmap_get_entry(&task_map, &key, ent, NULL);
+
+	if (!task) {
+		error(_("'%s' is not a valid task"), arg);
+		return 1;
+	}
+
+	if (task->selected) {
+		error(_("task '%s' cannot be selected multiple times"), arg);
+		return 1;
+	}
+
+	task->selected = 1;
+	task->task_order = opts.tasks_selected;
+
+	return 0;
+}
+
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
 {
 	struct repository *r = the_repository;
@@ -882,6 +937,9 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 			 N_("run tasks based on the state of the repository")),
 		OPT_BOOL(0, "quiet", &opts.quiet,
 			 N_("do not report progress or other information over stderr")),
+		OPT_CALLBACK_F(0, "task", NULL, N_("task"),
+			N_("run a specific task"),
+			PARSE_OPT_NONEG, task_option_parse),
 		OPT_END()
 	};
 
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 216ac0b19e..c09a9eb90b 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -21,4 +21,27 @@ test_expect_success 'run [--auto|--quiet]' '
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
+test_expect_success 'run --task=<task>' '
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-gc.txt" git maintenance run --task=gc &&
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-both.txt" git maintenance run --task=commit-graph --task=gc &&
+	! grep ",\"gc\"" run-commit-graph.txt  &&
+	grep ",\"gc\"" run-gc.txt  &&
+	grep ",\"gc\"" run-both.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-commit-graph.txt  &&
+	! grep ",\"commit-graph\",\"write\"" run-gc.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-both.txt
+'
+
+test_expect_success 'run --task=bogus' '
+	test_must_fail git maintenance run --task=bogus 2>err &&
+	test_i18ngrep "is not a valid task" err
+'
+
+test_expect_success 'run --task duplicate' '
+	test_must_fail git maintenance run --task=gc --task=gc 2>err &&
+	test_i18ngrep "cannot be selected multiple times" err
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 11/21] maintenance: take a lock on the objects directory
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 10/21] maintenance: add --task option Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 12/21] maintenance: add fetch task Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Performing maintenance on a Git repository involves writing data to the
.git directory, which is not safe to do with multiple writers attempting
the same operation. Ensure that only one 'git maintenance' process is
running at a time by holding a file-based lock. Simply the presence of
the .git/maintenance.lock file will prevent future maintenance. This
lock is never committed, since it does not represent meaningful data.
Instead, it is only a placeholder.

If the lock file already exists, then fail silently. This will become
very important later when we implement the 'fetch' task, as this is our
stop-gap from creating a recursive process loop between 'git fetch' and
'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index 8f2143862c..e3c634fc3b 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -849,6 +849,24 @@ static int maintenance_run(struct repository *r)
 {
 	int i;
 	int result = 0;
+	struct lock_file lk;
+	char *lock_path = xstrfmt("%s/maintenance", r->objects->odb->path);
+
+	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
+		/*
+		 * Another maintenance command is running.
+		 *
+		 * If --auto was provided, then it is likely due to a
+		 * recursive process stack. Do not report an error in
+		 * that case.
+		 */
+		if (!opts.auto_flag && !opts.quiet)
+			error(_("lock file '%s' exists, skipping maintenance"),
+			      lock_path);
+		free(lock_path);
+		return 0;
+	}
+	free(lock_path);
 
 	if (opts.tasks_selected)
 		QSORT(tasks, num_tasks, compare_tasks_by_selection);
@@ -863,6 +881,7 @@ static int maintenance_run(struct repository *r)
 		result = tasks[i]->fn(r);
 	}
 
+	rollback_lock_file(&lk);
 	return result;
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 12/21] maintenance: add fetch task
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 11/21] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-09  2:35   ` Jonathan Tan
  2020-07-07 14:21 ` [PATCH 13/21] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.

3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/hidden/<remote> refs that no
   longer appear on the remote.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 12 ++++++
 builtin/gc.c                      | 65 ++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            | 24 ++++++++++++
 3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 9204762e21..e0be3f520f 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+fetch::
+	The `fetch` job updates the object directory with the latest objects
+	from all registered remotes. For each remote, a `git fetch` command
+	is run. The refmap is custom to avoid updating local or remote
+	branches (those in `refs/heads` or `refs/remotes`). Instead, the
+	remote refs are stored in `refs/hidden/<remote>/`. Also, no tags are
+	updated.
++
+This means that foreground fetches are still required to update the
+remote refs, but the users is notified when the branches and tags are
+updated on the remote.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index e3c634fc3b..2d30ae758c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -28,6 +28,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "promisor-remote.h"
+#include "remote.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -705,7 +706,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 2
+#define MAX_NUM_TASKS 3
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -788,6 +789,64 @@ static int maintenance_task_commit_graph(struct repository *r)
 	return 1;
 }
 
+static int fetch_remote(struct repository *r, const char *remote)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf refmap = STRBUF_INIT;
+
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "fetch", remote, "--prune",
+			 "--no-tags", "--refmap=", NULL);
+
+	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
+	argv_array_push(&cmd, refmap.buf);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--quiet");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	strbuf_release(&refmap);
+	return result;
+}
+
+static int fill_each_remote(struct remote *remote, void *cbdata)
+{
+	struct string_list *remotes = (struct string_list *)cbdata;
+
+	string_list_append(remotes, remote->name);
+	return 0;
+}
+
+static int maintenance_task_fetch(struct repository *r)
+{
+	int result = 0;
+	struct string_list_item *item;
+	struct string_list remotes = STRING_LIST_INIT_DUP;
+
+	if (for_each_remote(fill_each_remote, &remotes)) {
+		error(_("failed to fill remotes"));
+		result = 1;
+		goto cleanup;
+	}
+
+	/*
+	 * Do not modify the result based on the success of the 'fetch'
+	 * operation, as a loss of network could cause 'fetch' to fail
+	 * quickly. We do not want that to stop the rest of our
+	 * background operations.
+	 */
+	for (item = remotes.items;
+	     item && item < remotes.items + remotes.nr;
+	     item++)
+		fetch_remote(r, item->string);
+
+cleanup:
+	string_list_clear(&remotes, 0);
+	return result;
+}
+
 static int maintenance_task_gc(struct repository *r)
 {
 	int result;
@@ -893,6 +952,10 @@ static void initialize_tasks(void)
 	for (i = 0; i < MAX_NUM_TASKS; i++)
 		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
 
+	tasks[num_tasks]->name = "fetch";
+	tasks[num_tasks]->fn = maintenance_task_fetch;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index c09a9eb90b..0abfc4a9da 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -44,4 +44,28 @@ test_expect_success 'run --task duplicate' '
 	test_i18ngrep "cannot be selected multiple times" err
 '
 
+test_expect_success 'run --task=fetch with no remotes' '
+	git maintenance run --task=fetch 2>err &&
+	test_must_be_empty err
+'
+
+test_expect_success 'fetch multiple remotes' '
+	git clone . clone1 &&
+	git clone . clone2 &&
+	git remote add remote1 "file://$(pwd)/clone1" &&
+	git remote add remote2 "file://$(pwd)/clone2" &&
+	git -C clone1 switch -c one &&
+	git -C clone2 switch -c two &&
+	test_commit -C clone1 one &&
+	test_commit -C clone2 two &&
+	GIT_TRACE2_EVENT="$(pwd)/run-fetch.txt" git maintenance run --task=fetch &&
+	grep ",\"fetch\",\"remote1\"" run-fetch.txt &&
+	grep ",\"fetch\",\"remote2\"" run-fetch.txt &&
+	test_path_is_missing .git/refs/remotes &&
+	test_cmp clone1/.git/refs/heads/one .git/refs/hidden/remote1/one &&
+	test_cmp clone2/.git/refs/heads/two .git/refs/hidden/remote2/two &&
+	git log hidden/remote1/one &&
+	git log hidden/remote2/two
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 13/21] maintenance: add loose-objects task
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 12/21] maintenance: add fetch task Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 14/21] maintenance: add pack-files task Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  11 +++
 builtin/gc.c                      | 107 +++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            |  35 ++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index e0be3f520f..bf792d446f 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,17 @@ gc::
 	It can also be disruptive in some situations, as it deletes stale
 	data.
 
+loose-objects::
+	The `loose-objects` job cleans up loose objects and places them into
+	pack-files. In order to prevent race conditions with concurrent Git
+	commands, it follows a two-step process. First, it deletes any loose
+	objects that already exist in a pack-file; concurrent Git processes
+	will examine the pack-file for the object data instead of the loose
+	object. Second, it creates a new pack-file (starting with "loose-")
+	containing a batch of loose objects. The batch size is limited to 50
+	thousand objects to prevent the job from taking too long on a
+	repository with many loose objects.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 2d30ae758c..dda71fe39c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -706,7 +706,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 3
+#define MAX_NUM_TASKS 4
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -866,6 +866,107 @@ static int maintenance_task_gc(struct repository *r)
 	return result;
 }
 
+
+static int prune_packed(struct repository *r)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", r->worktree, "prune-packed", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--quiet");
+
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+struct write_loose_object_data {
+	FILE *in;
+	int count;
+	int batch_size;
+};
+
+static int loose_object_exists(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+				       const char *path,
+				       void *data)
+{
+	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+	fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+	return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(struct repository *r)
+{
+	int result = 0;
+	struct write_loose_object_data data;
+	struct strbuf prefix = STRBUF_INIT;
+	struct child_process *pack_proc;
+
+	/*
+	 * Do not start pack-objects process
+	 * if there are no loose objects.
+	 */
+	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
+					   loose_object_exists,
+					   NULL, NULL, NULL))
+		return 0;
+
+	pack_proc = xmalloc(sizeof(*pack_proc));
+
+	child_process_init(pack_proc);
+
+	strbuf_addstr(&prefix, r->objects->odb->path);
+	strbuf_addstr(&prefix, "/pack/loose");
+
+	argv_array_pushl(&pack_proc->args, "git", "-C", r->worktree,
+			 "pack-objects", NULL);
+	if (opts.quiet)
+		argv_array_push(&pack_proc->args, "--quiet");
+	argv_array_push(&pack_proc->args, prefix.buf);
+
+	pack_proc->in = -1;
+
+	if (start_command(pack_proc)) {
+		error(_("failed to start 'git pack-objects' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	data.in = xfdopen(pack_proc->in, "w");
+	data.count = 0;
+	data.batch_size = 50000;
+
+	for_each_loose_file_in_objdir(r->objects->odb->path,
+				      write_loose_object_to_stdin,
+				      NULL,
+				      NULL,
+				      &data);
+
+	fclose(data.in);
+
+	if (finish_command(pack_proc)) {
+		error(_("failed to finish 'git pack-objects' process"));
+		result = 1;
+	}
+
+cleanup:
+	strbuf_release(&prefix);
+	free(pack_proc);
+	return result;
+}
+
+static int maintenance_task_loose_objects(struct repository *r)
+{
+	return prune_packed(r) || pack_loose(r);
+}
+
 typedef int maintenance_task_fn(struct repository *r);
 
 struct maintenance_task {
@@ -956,6 +1057,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_fetch;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "loose-objects";
+	tasks[num_tasks]->fn = maintenance_task_loose_objects;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 0abfc4a9da..8cb33624c6 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -68,4 +68,39 @@ test_expect_success 'fetch multiple remotes' '
 	git log hidden/remote2/two
 '
 
+test_expect_success 'loose-objects task' '
+	# Repack everything so we know the state of the object dir
+	git repack -adk &&
+
+	# Hack to stop maintenance from running during "git commit"
+	echo in use >.git/objects/maintenance.lock &&
+	test_commit create-loose-object &&
+	rm .git/objects/maintenance.lock &&
+
+	ls .git/objects >obj-dir-before &&
+	test_file_not_empty obj-dir-before &&
+	ls .git/objects/pack/*.pack >packs-before &&
+	test_line_count = 1 packs-before &&
+
+	# The first run creates a pack-file
+	# but does not delete loose objects.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-between &&
+	test_cmp obj-dir-before obj-dir-between &&
+	ls .git/objects/pack/*.pack >packs-between &&
+	test_line_count = 2 packs-between &&
+
+	# The second run deletes loose objects
+	# but does not create a pack-file.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-after &&
+	cat >expect <<-\EOF &&
+	info
+	pack
+	EOF
+	test_cmp expect obj-dir-after &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_cmp packs-between packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 14/21] maintenance: add pack-files task
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 13/21] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 15/21] maintenance: auto-size pack-files batch Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'pack-files' job runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the pack-files job will delete these repacked
   pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  15 ++++
 builtin/gc.c                      | 124 +++++++++++++++++++++++++++++-
 midx.c                            |   2 +-
 midx.h                            |   1 +
 t/t7900-maintenance.sh            |  37 +++++++++
 5 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index bf792d446f..945fda368b 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -84,6 +84,21 @@ loose-objects::
 	thousand objects to prevent the job from taking too long on a
 	repository with many loose objects.
 
+pack-files::
+	The `pack-files` job incrementally repacks the object directory
+	using the `multi-pack-index` feature. In order to prevent race
+	conditions with concurrent Git commands, it follows a two-step
+	process. First, it deletes any pack-files included in the
+	`multi-pack-index` where none of the objects in the
+	`multi-pack-index` reference those pack-files; this only happens
+	if all objects in the pack-file are also stored in a newer
+	pack-file. Second, it selects a group of pack-files whose "expected
+	size" is below the batch size until the group has total expected
+	size at least the batch size; see the `--batch-size` option for
+	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
+	default batch-size is zero, which is a special case that attempts
+	to repack all pack-files into a single pack-file.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index dda71fe39c..259b0475c0 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -29,6 +29,7 @@
 #include "tree.h"
 #include "promisor-remote.h"
 #include "remote.h"
+#include "midx.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -706,7 +707,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 4
+#define MAX_NUM_TASKS 5
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -967,6 +968,123 @@ static int maintenance_task_loose_objects(struct repository *r)
 	return prune_packed(r) || pack_loose(r);
 }
 
+static int multi_pack_index_write(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "multi-pack-index", "write", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int rewrite_multi_pack_index(struct repository *r)
+{
+	char *midx_name = get_midx_filename(r->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	if (multi_pack_index_write(r)) {
+		error(_("failed to rewrite multi-pack-index"));
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_verify(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "multi-pack-index", "verify", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int multi_pack_index_expire(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "multi-pack-index", "expire", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	close_object_store(r->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int multi_pack_index_repack(struct repository *r)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "-C", r->worktree,
+			 "multi-pack-index", "repack", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	argv_array_push(&cmd, "--batch-size=0");
+
+	close_object_store(r->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	if (result && multi_pack_index_verify(r)) {
+		warning(_("multi-pack-index verify failed after repack"));
+		result = rewrite_multi_pack_index(r);
+	}
+
+	return result;
+}
+
+static int maintenance_task_pack_files(struct repository *r)
+{
+	if (multi_pack_index_write(r)) {
+		error(_("failed to write multi-pack-index"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify(r)) {
+		warning(_("multi-pack-index verify failed after initial write"));
+		return rewrite_multi_pack_index(r);
+	}
+
+	if (multi_pack_index_expire(r)) {
+		error(_("multi-pack-index expire failed"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify(r)) {
+		warning(_("multi-pack-index verify failed after expire"));
+		return rewrite_multi_pack_index(r);
+	}
+
+	if (multi_pack_index_repack(r)) {
+		error(_("multi-pack-index repack failed"));
+		return 1;
+	}
+
+	return 0;
+}
+
 typedef int maintenance_task_fn(struct repository *r);
 
 struct maintenance_task {
@@ -1061,6 +1179,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "pack-files";
+	tasks[num_tasks]->fn = maintenance_task_pack_files;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/midx.c b/midx.c
index 6d1584ca51..57a8a00082 100644
--- a/midx.c
+++ b/midx.c
@@ -36,7 +36,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index b18cf53bc4..baeecc70c9 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 8cb33624c6..a6be8456aa 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'pack-files task' '
+	packDir=.git/objects/pack &&
+	for i in $(test_seq 1 5)
+	do
+		test_commit $i || return 1
+	done &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
+	test_tick &&
+	git pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+	test_tick &&
+	git pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+	rm -f $packDir/pack-* &&
+	rm -f $packDir/loose-* &&
+	ls $packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git maintenance run --task=pack-files &&
+	ls $packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git maintenance run --task=pack-files &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 15/21] maintenance: auto-size pack-files batch
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 14/21] maintenance: add pack-files task Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 16/21] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'pack-files' job, we use the --batch-size
option in 'git multi-pack-index repack'. The initial setting used
--batch-size=0 to repack everything into a single pack-file. This is not
sustaintable for a large repository. The amount of work required is also
likely to use too many system resources for a background job.

Update the 'pack-files' maintenance task by dynamically computing a
--batch-size option based on the current pack-file structure.

The dynamic default size is computed with this idea in mind for a client
repository that was cloned from a very large remote: there is likely one
"big" pack-file that was created at clone time. Thus, do not try
repacking it as it is likely packed efficiently by the server.

Instead, we select the second-largest pack-file, and create a batch size
that is one larger than that pack-file. If there are three or more
pack-files, then this guarantees that at least two will be combined into
a new pack-file.

Of course, this means that the second-largest pack-file size is likely
to grow over time and may eventually surpass the initially-cloned
pack-file. Recall that the pack-file batch is selected in a greedy
manner: the packs are considered from oldest to newest and are selected
if they have size smaller than the batch size until the total selected
size is larger than the batch size. Thus, that oldest "clone" pack will
be first to repack after the new data creates a pack larger than that.

We also want to place some limits on how large these pack-files become,
in order to bound the amount of time spent repacking. A maximum
batch-size of two gigabytes means that large repositories will never be
packed into a single pack-file using this job, but also that repack is
rather expensive. This is a trade-off that is valuable to have if the
maintenance is being run automatically or in the background. Users who
truly want to optimize for space and performance (and are willing to pay
the upfront cost of a full repack) can use the 'gc' task to do so.

Reported-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c           | 47 +++++++++++++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh |  5 +++--
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 259b0475c0..582219156a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1032,20 +1032,65 @@ static int multi_pack_index_expire(struct repository *r)
 	return result;
 }
 
+#define TWO_GIGABYTES (2147483647)
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+
+static off_t get_auto_pack_size(struct repository *r)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly.
+	 *
+	 * The strategy we select here is to select a
+	 * size that is one more than the second largest
+	 * pack-file. This ensures that we will repack
+	 * at least two packs if there are three or more
+	 * packs.
+	 */
+	off_t max_size = 0;
+	off_t second_largest_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+
+	reprepare_packed_git(r);
+	for (p = get_all_packs(r); p; p = p->next) {
+		if (p->pack_size > max_size) {
+			second_largest_size = max_size;
+			max_size = p->pack_size;
+		} else if (p->pack_size > second_largest_size)
+			second_largest_size = p->pack_size;
+	}
+
+	result_size = second_largest_size + 1;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
 static int multi_pack_index_repack(struct repository *r)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+
 	argv_array_pushl(&cmd, "-C", r->worktree,
 			 "multi-pack-index", "repack", NULL);
 
 	if (opts.quiet)
 		argv_array_push(&cmd, "--no-progress");
 
-	argv_array_push(&cmd, "--batch-size=0");
+	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
+			    (uintmax_t)get_auto_pack_size(r));
+	argv_array_push(&cmd, batch_arg.buf);
 
 	close_object_store(r->objects);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	strbuf_release(&batch_arg);
 
 	if (result && multi_pack_index_verify(r)) {
 		warning(_("multi-pack-index verify failed after repack"));
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index a6be8456aa..43d32c131b 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -134,10 +134,11 @@ test_expect_success 'pack-files task' '
 	test_line_count = 4 packs-between &&
 
 	# the job deletes the two old packs, and does not write
-	# a new one because only one pack remains.
+	# a new one because the batch size is not high enough to
+	# pack the largest pack-file.
 	git maintenance run --task=pack-files &&
 	ls .git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	test_line_count = 2 packs-after
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 16/21] maintenance: create maintenance.<task>.enabled config
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 15/21] maintenance: auto-size pack-files batch Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 17/21] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Currently, a normal run of "git maintenance run" will only run the 'gc'
task, as it is the only one enabled. This is mostly for backwards-
compatible reasons since "git maintenance run --auto" commands replaced
previous "git gc --auto" commands after some Git processes. Users could
manually run specific maintenance tasks by calling "git maintenance run
--task=<task>" directly.

Allow users to customize which steps are run automatically using config.
The 'maintenance.<task>.enabled' option then can turn on these other
tasks (or turn off the 'gc' task).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt             |  2 ++
 Documentation/config/maintenance.txt |  4 ++++
 Documentation/git-maintenance.txt    |  6 +++++-
 builtin/gc.c                         | 15 +++++++++++++--
 t/t7900-maintenance.sh               | 12 ++++++++++++
 5 files changed, 36 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..2783b825f9 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -396,6 +396,8 @@ include::config/mailinfo.txt[]
 
 include::config/mailmap.txt[]
 
+include::config/maintenance.txt[]
+
 include::config/man.txt[]
 
 include::config/merge.txt[]
diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
new file mode 100644
index 0000000000..370cbfb42f
--- /dev/null
+++ b/Documentation/config/maintenance.txt
@@ -0,0 +1,4 @@
+maintenance.<task>.enabled::
+	This boolean config option controls whether the maintenance task
+	with name `<task>` is run when no `--task` option is specified.
+	By default, only `maintenance.gc.enabled` is true.
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 945fda368b..261d2e0ee1 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -30,7 +30,11 @@ SUBCOMMANDS
 -----------
 
 run::
-	Run one or more maintenance tasks.
+	Run one or more maintenance tasks. If one or more `--task` options
+	are specified, then those tasks are run in that order. Otherwise,
+	the tasks are determined by which `maintenance.<task>.enabled`
+	config options are true. By default, only `maintenance.gc.enabled`
+	is true.
 
 TASKS
 -----
diff --git a/builtin/gc.c b/builtin/gc.c
index 582219156a..6ffe2213b4 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1208,9 +1208,10 @@ static int maintenance_run(struct repository *r)
 	return result;
 }
 
-static void initialize_tasks(void)
+static void initialize_tasks(struct repository *r)
 {
 	int i;
+	struct strbuf config_name = STRBUF_INIT;
 	num_tasks = 0;
 
 	for (i = 0; i < MAX_NUM_TASKS; i++)
@@ -1240,10 +1241,20 @@ static void initialize_tasks(void)
 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
 
 	for (i = 0; i < num_tasks; i++) {
+		int config_value;
+
 		hashmap_entry_init(&tasks[i]->ent,
 				   strihash(tasks[i]->name));
 		hashmap_add(&task_map, &tasks[i]->ent);
+
+		strbuf_setlen(&config_name, 0);
+		strbuf_addf(&config_name, "maintenance.%s.enabled", tasks[i]->name);
+
+		if (!repo_config_get_bool(r, config_name.buf, &config_value))
+			tasks[i]->enabled = config_value;
 	}
+
+	strbuf_release(&config_name);
 }
 
 static int task_option_parse(const struct option *opt,
@@ -1304,7 +1315,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 				   builtin_maintenance_options);
 
 	opts.quiet = !isatty(2);
-	initialize_tasks();
+	initialize_tasks(r);
 
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 43d32c131b..08aa50a724 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -21,6 +21,18 @@ test_expect_success 'run [--auto|--quiet]' '
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
+test_expect_success 'maintenance.<task>.enabled' '
+	git config maintenance.gc.enabled false &&
+	git config maintenance.commit-graph.enabled true &&
+	git config maintenance.loose-objects.enabled true &&
+	GIT_TRACE2_EVENT="$(pwd)/run-config.txt" git maintenance run &&
+	! grep ",\"fetch\"" run-config.txt &&
+	! grep ",\"gc\"" run-config.txt &&
+	! grep ",\"multi-pack-index\"" run-config.txt &&
+	grep ",\"commit-graph\"" run-config.txt &&
+	grep ",\"prune-packed\"" run-config.txt
+'
+
 test_expect_success 'run --task=<task>' '
 	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
 	GIT_TRACE2_EVENT="$(pwd)/run-gc.txt" git maintenance run --task=gc &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 17/21] maintenance: use pointers to check --auto
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 16/21] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 18/21] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git maintenance run' command has an '--auto' option. This is used
by other Git commands such as 'git commit' or 'git fetch' to check if
maintenance should be run after adding data to the repository.

Previously, this --auto option was only used to add the argument to the
'git gc' command as part of the 'gc' task. We will be expanding the
other tasks to perform a check to see if they should do work as part of
the --auto flag, when they are enabled by config.

First, update the 'gc' task to perform the auto check inside the
maintenance process. This prevents running an extra 'git gc --auto'
command when not needed. It also shows a model for other tasks.

Second, use the 'auto_condition' function pointer as a signal for
whether we enable the maintenance task under '--auto'. For instance, we
do not want to enable the 'fetch' task in '--auto' mode, so that
function pointer will remain NULL.

Now that we are not automatically calling 'git gc', a test in
t5514-fetch-multiple.sh must be changed to watch for 'git maintenance'
instead.

We continue to pass the '--auto' option to the 'git gc' command when
necessary, because of the gc.autoDetach config option changes behavior.
Likely, we will want to absorb the daemonizing behavior implied by
gc.autoDetach as a maintenance.autoDetach config option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c              | 15 +++++++++++++++
 t/t5514-fetch-multiple.sh |  2 +-
 t/t7900-maintenance.sh    |  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 6ffe2213b4..dd24026b41 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1132,10 +1132,18 @@ static int maintenance_task_pack_files(struct repository *r)
 
 typedef int maintenance_task_fn(struct repository *r);
 
+/*
+ * An auto condition function returns 1 if the task should run
+ * and 0 if the task should NOT run. See needs_to_gc() for an
+ * example.
+ */
+typedef int maintenance_auto_fn(struct repository *r);
+
 struct maintenance_task {
 	struct hashmap_entry ent;
 	const char *name;
 	maintenance_task_fn *fn;
+	maintenance_auto_fn *auto_condition;
 	int task_order;
 	unsigned enabled:1,
 		 selected:1;
@@ -1201,6 +1209,11 @@ static int maintenance_run(struct repository *r)
 		if (!opts.tasks_selected && !tasks[i]->enabled)
 			continue;
 
+		if (opts.auto_flag &&
+		    (!tasks[i]->auto_condition ||
+		     !tasks[i]->auto_condition(r)))
+			continue;
+
 		result = tasks[i]->fn(r);
 	}
 
@@ -1231,6 +1244,7 @@ static void initialize_tasks(struct repository *r)
 
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
+	tasks[num_tasks]->auto_condition = need_to_gc;
 	tasks[num_tasks]->enabled = 1;
 	num_tasks++;
 
@@ -1315,6 +1329,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 				   builtin_maintenance_options);
 
 	opts.quiet = !isatty(2);
+	gc_config(r);
 	initialize_tasks(r);
 
 	argc = parse_options(argc, argv, prefix,
diff --git a/t/t5514-fetch-multiple.sh b/t/t5514-fetch-multiple.sh
index de8e2f1531..bd202ec6f3 100755
--- a/t/t5514-fetch-multiple.sh
+++ b/t/t5514-fetch-multiple.sh
@@ -108,7 +108,7 @@ test_expect_success 'git fetch --multiple (two remotes)' '
 	 GIT_TRACE=1 git fetch --multiple one two 2>trace &&
 	 git branch -r > output &&
 	 test_cmp ../expect output &&
-	 grep "built-in: git gc" trace >gc &&
+	 grep "built-in: git maintenance" trace >gc &&
 	 test_line_count = 1 gc
 	)
 '
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 08aa50a724..315bba2447 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -17,7 +17,7 @@ test_expect_success 'run [--auto|--quiet]' '
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
 	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
 	grep ",\"gc\"]" run-no-auto.txt  &&
-	grep ",\"gc\",\"--auto\"" run-auto.txt &&
+	! grep ",\"gc\",\"--auto\"" run-auto.txt &&
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 18/21] maintenance: add auto condition for commit-graph task
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 17/21] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 19/21] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of writing a new commit-graph in every 'git maintenance run
--auto' process (when maintenance.commit-graph.enalbed is configured to
be true), only write when there are "enough" commits not in a
commit-graph file.

This count is controlled by the maintenance.commit-graph.auto config
option.

To compute the count, use a depth-first search starting at each ref, and
leaving markers using the PARENT1 flag. If this count reaches the limit,
then terminate early and start the task. Otherwise, this operation will
peel every ref and parse the commit it points to. If these are all in
the commit-graph, then this is typically a very fast operation. Users
with many refs might feel a slow-down, and hence could consider updating
their limit to be very small. A negative value will force the step to
run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt | 10 ++++
 builtin/gc.c                         | 76 ++++++++++++++++++++++++++++
 object.h                             |  1 +
 3 files changed, 87 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index 370cbfb42f..9bd69b9df3 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -2,3 +2,13 @@ maintenance.<task>.enabled::
 	This boolean config option controls whether the maintenance task
 	with name `<task>` is run when no `--task` option is specified.
 	By default, only `maintenance.gc.enabled` is true.
+
+maintenance.commit-graph.auto::
+	This integer config option controls how often the `commit-graph` task
+	should be run as part of `git maintenance run --auto`. If zero, then
+	the `commit-graph` task will not run with the `--auto` option. A
+	negative value will force the task to run every time. Otherwise, a
+	positive value implies the command should run when the number of
+	reachable commits that are not in the commit-graph file is at least
+	the value of `maintenance.commit-graph.auto`. The default value is
+	100.
diff --git a/builtin/gc.c b/builtin/gc.c
index dd24026b41..81b076b012 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -30,6 +30,7 @@
 #include "promisor-remote.h"
 #include "remote.h"
 #include "midx.h"
+#include "refs.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -720,6 +721,80 @@ struct maintenance_opts {
 	int tasks_selected;
 } opts;
 
+/* Remember to update object flag allocation in object.h */
+#define PARENT1		(1u<<16)
+
+static int num_commits_not_in_graph = 0;
+static int limit_commits_not_in_graph = 100;
+
+static int dfs_on_ref(const char *refname,
+		      const struct object_id *oid, int flags,
+		      void *cb_data)
+{
+	int result = 0;
+	struct object_id peeled;
+	struct commit_list *stack = NULL;
+	struct commit *commit;
+
+	if (!peel_ref(refname, &peeled))
+		oid = &peeled;
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	commit = lookup_commit(the_repository, oid);
+	if (!commit)
+		return 0;
+	if (parse_commit(commit))
+		return 0;
+
+	commit_list_append(commit, &stack);
+
+	while (!result && stack) {
+		struct commit_list *parent;
+
+		commit = pop_commit(&stack);
+
+		for (parent = commit->parents; parent; parent = parent->next) {
+			if (parse_commit(parent->item) ||
+			    commit_graph_position(parent->item) != COMMIT_NOT_FROM_GRAPH ||
+			    parent->item->object.flags & PARENT1)
+				continue;
+
+			parent->item->object.flags |= PARENT1;
+			num_commits_not_in_graph++;
+
+			if (num_commits_not_in_graph >= limit_commits_not_in_graph) {
+				result = 1;
+				break;
+			}
+
+			commit_list_append(parent->item, &stack);
+		}
+	}
+
+	free_commit_list(stack);
+	return result;
+}
+
+static int should_write_commit_graph(struct repository *r)
+{
+	int result;
+
+	repo_config_get_int(r, "maintenance.commit-graph.auto",
+			    &limit_commits_not_in_graph);
+
+	if (!limit_commits_not_in_graph)
+		return 0;
+	if (limit_commits_not_in_graph < 0)
+		return 1;
+
+	result = for_each_ref(dfs_on_ref, NULL);
+
+	clear_commit_marks_all(PARENT1);
+
+	return result;
+}
+
 static int run_write_commit_graph(struct repository *r)
 {
 	int result;
@@ -1250,6 +1325,7 @@ static void initialize_tasks(struct repository *r)
 
 	tasks[num_tasks]->name = "commit-graph";
 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
+	tasks[num_tasks]->auto_condition = should_write_commit_graph;
 	num_tasks++;
 
 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
diff --git a/object.h b/object.h
index 38dc2d5a6c..4f886495d7 100644
--- a/object.h
+++ b/object.h
@@ -73,6 +73,7 @@ struct object_array {
  * list-objects-filter.c:                                      21
  * builtin/fsck.c:           0--3
  * builtin/index-pack.c:                                     2021
+ * builtin/maintenance.c:                           16
  * builtin/pack-objects.c:                                   20
  * builtin/reflog.c:                   10--12
  * builtin/show-branch.c:    0-------------------------------------------26
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 19/21] maintenance: create auto condition for loose-objects
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 18/21] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 20/21] maintenance: add pack-files auto condition Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The loose-objects task deletes loose objects that already exist in a
pack-file, then place the remaining loose objects into a new pack-file.
If this step runs all the time, then we risk creating pack-files with
very few objects with every 'git commit' process. To prevent
overwhelming the packs directory with small pack-files, place a minimum
number of objects to justify the task.

The 'maintenance.loose-objects.auto' config option specifies a minimum
number of loose objects to justify the task to run under the '--auto'
option. This defaults to 100 loose objects. Setting the value to zero
will prevent the step from running under '--auto' while a negative value
will force it to run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt |  9 +++++++++
 builtin/gc.c                         | 30 ++++++++++++++++++++++++++++
 t/t7900-maintenance.sh               | 25 +++++++++++++++++++++++
 3 files changed, 64 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index 9bd69b9df3..a9442dd260 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -12,3 +12,12 @@ maintenance.commit-graph.auto::
 	reachable commits that are not in the commit-graph file is at least
 	the value of `maintenance.commit-graph.auto`. The default value is
 	100.
+
+maintenance.loose-objects.auto::
+	This integer config option controls how often the `loose-objects` task
+	should be run as part of `git maintenance run --auto`. If zero, then
+	the `loose-objects` task will not run with the `--auto` option. A
+	negative value will force the task to run every time. Otherwise, a
+	positive value implies the command should run when the number of
+	loose objects is at least the value of `maintenance.loose-objects.auto`.
+	The default value is 100.
diff --git a/builtin/gc.c b/builtin/gc.c
index 81b076b012..391e1e2121 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -960,6 +960,35 @@ struct write_loose_object_data {
 	int batch_size;
 };
 
+static int loose_object_auto_limit = 100;
+
+static int loose_object_count(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	int *count = (int*)data;
+	if (++(*count) >= loose_object_auto_limit)
+		return 1;
+	return 0;
+}
+
+static int loose_object_auto_condition(struct repository *r)
+{
+	int count = 0;
+
+	repo_config_get_int(r, "maintenance.loose-objects.auto",
+			    &loose_object_auto_limit);
+
+	if (!loose_object_auto_limit)
+		return 0;
+	if (loose_object_auto_limit < 0)
+		return 1;
+
+	return for_each_loose_file_in_objdir(r->objects->odb->path,
+					     loose_object_count,
+					     NULL, NULL, &count);
+}
+
 static int loose_object_exists(const struct object_id *oid,
 			       const char *path,
 			       void *data)
@@ -1311,6 +1340,7 @@ static void initialize_tasks(struct repository *r)
 
 	tasks[num_tasks]->name = "loose-objects";
 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
+	tasks[num_tasks]->auto_condition = loose_object_auto_condition;
 	num_tasks++;
 
 	tasks[num_tasks]->name = "pack-files";
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 315bba2447..a55c36d249 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -115,6 +115,31 @@ test_expect_success 'loose-objects task' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'maintenance.loose-objects.auto' '
+	git repack -adk &&
+	GIT_TRACE2_EVENT="$(pwd)/trace-lo1.txt" \
+		git -c maintenance.loose-objects.auto=1 maintenance \
+		run --auto --task=loose-objects &&
+	! grep "\"prune-packed\"" trace-lo1.txt &&
+	for i in 1 2
+	do
+		printf data-A-$i | git hash-object -t blob --stdin -w &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loA-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		! grep "\"prune-packed\"" trace-loA-$i &&
+		printf data-B-$i | git hash-object -t blob --stdin -w &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loB-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		grep "\"prune-packed\"" trace-loB-$i &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loC-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		grep "\"prune-packed\"" trace-loC-$i || return 1
+	done
+'
+
 test_expect_success 'pack-files task' '
 	packDir=.git/objects/pack &&
 	for i in $(test_seq 1 5)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 20/21] maintenance: add pack-files auto condition
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 19/21] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-07 14:21 ` [PATCH 21/21] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The pack-files task updates the multi-pack-index by deleting pack-files
that have been replaced with new packs, then repacking a batch of small
pack-files into a larger pack-file. This incremental repack is faster
than rewriting all object data, but is slower than some other
maintenance activities.

The 'maintenance.pack-files.auto' config option specifies how many
pack-files should exist outside of the multi-pack-index before running
the step. These pack-files could be created by 'git fetch' commands or
by the loose-objects task. The default value is 10.

Setting the option to zero disables the task with the '--auto' option,
and a negative value makes the task run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt |  9 ++++++++
 builtin/gc.c                         | 31 ++++++++++++++++++++++++++++
 t/t7900-maintenance.sh               | 30 +++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index a9442dd260..77b255318c 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -21,3 +21,12 @@ maintenance.loose-objects.auto::
 	positive value implies the command should run when the number of
 	loose objects is at least the value of `maintenance.loose-objects.auto`.
 	The default value is 100.
+
+maintenance.pack-files.auto::
+	This integer config option controls how often the `pack-files` task
+	should be run as part of `git maintenance run --auto`. If zero, then
+	the `pack-files` task will not run with the `--auto` option. A
+	negative value will force the task to run every time. Otherwise, a
+	positive value implies the command should run when the number of
+	pack-files not in the multi-pack-index is at least the value of
+	`maintenance.pack-files.auto`. The default value is 10.
diff --git a/builtin/gc.c b/builtin/gc.c
index 391e1e2121..c3531561c2 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -31,6 +31,7 @@
 #include "remote.h"
 #include "midx.h"
 #include "refs.h"
+#include "object-store.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -1072,6 +1073,35 @@ static int maintenance_task_loose_objects(struct repository *r)
 	return prune_packed(r) || pack_loose(r);
 }
 
+static int pack_files_auto_condition(struct repository *r)
+{
+	struct packed_git *p;
+	int enabled;
+	int pack_files_auto_limit = 10;
+	int count = 0;
+
+	if (repo_config_get_bool(r, "core.multiPackIndex", &enabled) ||
+	    !enabled)
+		return 0;
+
+	repo_config_get_int(r, "maintenance.pack-files.auto",
+			    &pack_files_auto_limit);
+
+	if (!pack_files_auto_limit)
+		return 0;
+	if (pack_files_auto_limit < 0)
+		return 1;
+
+	for (p = get_packed_git(r);
+	     count < pack_files_auto_limit && p;
+	     p = p->next) {
+		if (!p->multi_pack_index)
+			count++;
+	}
+
+	return count >= pack_files_auto_limit;
+}
+
 static int multi_pack_index_write(struct repository *r)
 {
 	int result;
@@ -1345,6 +1375,7 @@ static void initialize_tasks(struct repository *r)
 
 	tasks[num_tasks]->name = "pack-files";
 	tasks[num_tasks]->fn = maintenance_task_pack_files;
+	tasks[num_tasks]->auto_condition = pack_files_auto_condition;
 	num_tasks++;
 
 	tasks[num_tasks]->name = "gc";
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index a55c36d249..1714d11bd9 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -178,4 +178,34 @@ test_expect_success 'pack-files task' '
 	test_line_count = 2 packs-after
 '
 
+test_expect_success 'maintenance.pack-files.auto' '
+	git repack -adk &&
+	git config core.multiPackIndex true &&
+	git multi-pack-index write &&
+	GIT_TRACE2_EVENT=1 git -c maintenance.pack-files.auto=1 maintenance \
+		run --auto --task=pack-files >out &&
+	! grep "\"multi-pack-index\"" out &&
+	for i in 1 2
+	do
+		test_commit A-$i &&
+		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
+		HEAD
+		^HEAD~1
+		EOF
+		GIT_TRACE2_EVENT=$(pwd)/trace-A-$i git \
+			-c maintenance.pack-files.auto=2 \
+			maintenance run --auto --task=pack-files &&
+		! grep "\"multi-pack-index\"" trace-A-$i &&
+		test_commit B-$i &&
+		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
+		HEAD
+		^HEAD~1
+		EOF
+		GIT_TRACE2_EVENT=$(pwd)/trace-B-$i git \
+			-c maintenance.pack-files.auto=2 \
+			maintenance run --auto --task=pack-files >out &&
+		grep "\"multi-pack-index\"" trace-B-$i >/dev/null || return 1
+	done
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH 21/21] midx: use start_delayed_progress()
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 20/21] maintenance: add pack-files auto condition Derrick Stolee via GitGitGadget
@ 2020-07-07 14:21 ` Derrick Stolee via GitGitGadget
  2020-07-08 23:57 ` [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Now that the multi-pack-index may be written as part of auto maintenance
at the end of a command, reduce the progress output when the operations
are quick. Use start_delayed_progress() instead of start_progress().

Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that
the progress indicators are conditional.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 10 +++++-----
 t/t5319-multi-pack-index.sh | 14 +++++++-------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/midx.c b/midx.c
index 57a8a00082..d4022e4aef 100644
--- a/midx.c
+++ b/midx.c
@@ -837,7 +837,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	packs.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_progress(_("Adding packfiles to multi-pack-index"), 0);
+		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
 		packs.progress = NULL;
 
@@ -974,7 +974,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Writing chunks to multi-pack-index"),
+		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (written != chunk_offsets[i])
@@ -1109,7 +1109,7 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag
 		return 0;
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Looking for referenced packfiles"),
+		progress = start_delayed_progress(_("Looking for referenced packfiles"),
 					  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(r, m, i))
@@ -1230,7 +1230,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	count = xcalloc(m->num_packs, sizeof(uint32_t));
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Counting referenced objects"),
+		progress = start_delayed_progress(_("Counting referenced objects"),
 					  m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
 		int pack_int_id = nth_midxed_pack_int_id(m, i);
@@ -1240,7 +1240,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	stop_progress(&progress);
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Finding and deleting unreferenced packfiles"),
+		progress = start_delayed_progress(_("Finding and deleting unreferenced packfiles"),
 					  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		char *pack_name;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 7214cab36c..12f41dfc18 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -172,12 +172,12 @@ test_expect_success 'write progress off for redirected stderr' '
 '
 
 test_expect_success 'write force progress on for stderr' '
-	git multi-pack-index --object-dir=$objdir --progress write 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --progress write 2>err &&
 	test_file_not_empty err
 '
 
 test_expect_success 'write with the --no-progress option' '
-	git multi-pack-index --object-dir=$objdir --no-progress write 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --no-progress write 2>err &&
 	test_line_count = 0 err
 '
 
@@ -334,17 +334,17 @@ test_expect_success 'git-fsck incorrect offset' '
 '
 
 test_expect_success 'repack progress off for redirected stderr' '
-	git multi-pack-index --object-dir=$objdir repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack 2>err &&
 	test_line_count = 0 err
 '
 
 test_expect_success 'repack force progress on for stderr' '
-	git multi-pack-index --object-dir=$objdir --progress repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --progress repack 2>err &&
 	test_file_not_empty err
 '
 
 test_expect_success 'repack with the --no-progress option' '
-	git multi-pack-index --object-dir=$objdir --no-progress repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --no-progress repack 2>err &&
 	test_line_count = 0 err
 '
 
@@ -488,7 +488,7 @@ test_expect_success 'expire progress off for redirected stderr' '
 test_expect_success 'expire force progress on for stderr' '
 	(
 		cd dup &&
-		git multi-pack-index --progress expire 2>err &&
+		GIT_PROGRESS_DELAY=0 git multi-pack-index --progress expire 2>err &&
 		test_file_not_empty err
 	)
 '
@@ -496,7 +496,7 @@ test_expect_success 'expire force progress on for stderr' '
 test_expect_success 'expire with the --no-progress option' '
 	(
 		cd dup &&
-		git multi-pack-index --no-progress expire 2>err &&
+		GIT_PROGRESS_DELAY=0 git multi-pack-index --no-progress expire 2>err &&
 		test_line_count = 0 err
 	)
 '
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2020-07-07 14:21 ` [PATCH 21/21] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
@ 2020-07-08 23:57 ` Emily Shaffer
  2020-07-09 11:21   ` Derrick Stolee
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
  22 siblings, 1 reply; 164+ messages in thread
From: Emily Shaffer @ 2020-07-08 23:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
> This is a second attempt at redesigning Git's repository maintenance
> patterns. The first attempt [1] included a way to run jobs in the background
> using a long-lived process; that idea was rejected and is not included in
> this series. A future series will use the OS to handle scheduling tasks.
> 
> [1] 
> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
> 
> As mentioned before, git gc already plays the role of maintaining Git
> repositories. It has accumulated several smaller pieces in its long history,
> including:
> 
>  1. Repacking all reachable objects into one pack-file (and deleting
>     unreachable objects).
>  2. Packing refs.
>  3. Expiring reflogs.
>  4. Clearing rerere logs.
>  5. Updating the commit-graph file.
> 
> While expiring reflogs, clearing rererelogs, and deleting unreachable
> objects are suitable under the guise of "garbage collection", packing refs
> and updating the commit-graph file are not as obviously fitting. Further,
> these operations are "all or nothing" in that they rewrite almost all
> repository data, which does not perform well at extremely large scales.
> These operations can also be disruptive to foreground Git commands when git
> gc --auto triggers during routine use.
> 
> This series does not intend to change what git gc does, but instead create
> new choices for automatic maintenance activities, of which git gc remains
> the only one enabled by default.
> 
> The new maintenance tasks are:
> 
>  * 'commit-graph' : write and verify a single layer of an incremental
>    commit-graph.
>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>    a batch of loose objects.
>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>    repack using the multi-pack-index's incremental repack strategy.
>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
> 
> These tasks are all disabled by default, but can be enabled with config
> options or run explicitly using "git maintenance run --task=". There are
> additional config options to allow customizing the conditions for which the
> tasks run during the '--auto' option. ('fetch' will never run with the
> '--auto' option.)
> 
>  Because 'gc' is implemented as a maintenance task, the most dramatic change
> of this series is to convert the 'git gc --auto' calls into 'git maintenance
> run --auto' calls at the end of some Git commands. By default, the only
> change is that 'git gc --auto' will be run below an additional 'git
> maintenance' process.
> 
> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
> later with subcommands that manage background maintenance, such as 'start',
> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
> it is important to focus on the maintenance activities themselves.
> 
> An expert user could set up scheduled background maintenance themselves with
> the current series. I have the following crontab data set up to run
> maintenance on an hourly basis:
> 
> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

One thing I wonder about - now I have to go and make a new crontab
(which is easy) or Task Scheduler task (which is a pain) for every repo,
right?

Is it infeasible to ask for 'git maintenance' to learn something like
'--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
config like "maintenance.targetRepo = /<path-to-repo>"?

> 
> My config includes all tasks except the 'gc' task. The hourly run is
> over-aggressive, but is sufficient for testing. I'll replace it with daily
> when I feel satisfied.
> 
> Hopefully this direction is seen as a positive one. My goal was to add more
> options for expert users, along with the flexibility to create background
> maintenance via the OS in a later series.
> 
> OUTLINE
> =======
> 
> Patches 1-4 remove some references to the_repository in builtin/gc.c before
> we start depending on code in that builtin.
> 
> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
> commands.

For me, I'd prefer to see 'git maintenance run' get bigger and 'git gc
--auto' get smaller or disappear. Is there a plan towards that
direction, or is that out of scope for 'git maintenance run'? Similar
examples I can think of include 'git annotate' and 'git pickaxe'.

> 
> Patches 8-15 create new maintenance tasks. These are the same tasks sent in
> the previous RFC.
> 
> Patches 16-21 create more customization through config and perform other
> polish items.
> 
> FUTURE WORK
> ===========
> 
>  * Add 'start', 'stop', and 'schedule' subcommands to initialize the
>    commands run in the background.
>    
>    
>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>    default, but might have different '--auto' conditions and more config
>    options.

Like I mentioned above, for me, I'd rather just see the 'gc' builtin go
away :)

>    
>    
>  * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
>    with use of the 'commit-graph' task.
>    
>    
> 
> Thanks, -Stolee
> 
> Derrick Stolee (21):
>   gc: use the_repository less often
>   gc: use repository in too_many_loose_objects()
>   gc: use repo config
>   gc: drop the_repository in log location
>   maintenance: create basic maintenance runner
>   maintenance: add --quiet option
>   maintenance: replace run_auto_gc()
>   maintenance: initialize task array and hashmap
>   maintenance: add commit-graph task
>   maintenance: add --task option
>   maintenance: take a lock on the objects directory
>   maintenance: add fetch task
>   maintenance: add loose-objects task
>   maintenance: add pack-files task
>   maintenance: auto-size pack-files batch
>   maintenance: create maintenance.<task>.enabled config
>   maintenance: use pointers to check --auto
>   maintenance: add auto condition for commit-graph task
>   maintenance: create auto condition for loose-objects
>   maintenance: add pack-files auto condition
>   midx: use start_delayed_progress()
> 
>  .gitignore                           |   1 +
>  Documentation/config.txt             |   2 +
>  Documentation/config/maintenance.txt |  32 +
>  Documentation/fetch-options.txt      |   5 +-
>  Documentation/git-clone.txt          |   7 +-
>  Documentation/git-maintenance.txt    | 124 ++++
>  builtin.h                            |   1 +
>  builtin/am.c                         |   2 +-
>  builtin/commit.c                     |   2 +-
>  builtin/fetch.c                      |   6 +-
>  builtin/gc.c                         | 881 +++++++++++++++++++++++++--
>  builtin/merge.c                      |   2 +-
>  builtin/rebase.c                     |   4 +-
>  commit-graph.c                       |   8 +-
>  commit-graph.h                       |   1 +
>  config.c                             |  24 +-
>  config.h                             |   2 +
>  git.c                                |   1 +
>  midx.c                               |  12 +-
>  midx.h                               |   1 +
>  object.h                             |   1 +
>  run-command.c                        |   7 +-
>  run-command.h                        |   2 +-
>  t/t5319-multi-pack-index.sh          |  14 +-
>  t/t5510-fetch.sh                     |   2 +-
>  t/t5514-fetch-multiple.sh            |   2 +-
>  t/t7900-maintenance.sh               | 211 +++++++
>  27 files changed, 1265 insertions(+), 92 deletions(-)
>  create mode 100644 Documentation/config/maintenance.txt
>  create mode 100644 Documentation/git-maintenance.txt
>  create mode 100755 t/t7900-maintenance.sh
> 
> 
> base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/671
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/21] gc: drop the_repository in log location
  2020-07-07 14:21 ` [PATCH 04/21] gc: drop the_repository in log location Derrick Stolee via GitGitGadget
@ 2020-07-09  2:22   ` Jonathan Tan
  2020-07-09 11:13     ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Jonathan Tan @ 2020-07-09  2:22 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee,
	Jonathan Tan

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The report_last_gc_error() method use git_pathdup() which implicitly
> uses the_repository. Replace this with strbuf_repo_path() to get a
> path buffer we control that uses a given repository pointer.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Regarding the first 4 patches, it would be great if there was a test
like the one in test_repository.c (together with a code coverage report,
perhaps) that verifies that all code paths do not use the_repository.

But set aside that test for now - I don't think gc.c fully supports
arbitrary repositories. In particular, when running a subprocess, it
inherits the environment from the current process. I see that future
patches try to resolve this by passing "-C", but that does not work if
environment variables like GIT_DIR are set (because the environment
variables override the "-C"). Perhaps we need a function that runs a
subprocess in a specific repository. I ran into the same problem when
attempting to make fetch-pack (which runs index-pack as a subprocess)
support arbitrary repositories, but I haven't looked deeply into
resolving this yet (and I haven't looked at that problem in a while).

Having said that, I'm fine with these patches being in the set - the
only negative is that perhaps a reader would be misled into thinking
that GC supports arbitrary repositories.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 08/21] maintenance: initialize task array and hashmap
  2020-07-07 14:21 ` [PATCH 08/21] maintenance: initialize task array and hashmap Derrick Stolee via GitGitGadget
@ 2020-07-09  2:25   ` Jonathan Tan
  2020-07-09 13:15     ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Jonathan Tan @ 2020-07-09  2:25 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee,
	Jonathan Tan

> This list is also inserted into a hashmap. This allows command-line
> arguments to quickly find the tasks by name, not sensitive to case. To
> ensure this list and hashmap work well together, the list only contains
> pointers to the struct information. This will allow a sort on the list
> while preserving the hashmap data.

I think having the hashmap is unnecessarily complicated in this case -
with the small number of tasks, a list would be fine. But I don't feel
strongly about this.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 09/21] maintenance: add commit-graph task
  2020-07-07 14:21 ` [PATCH 09/21] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-09  2:29   ` Jonathan Tan
  2020-07-09 11:14     ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Jonathan Tan @ 2020-07-09  2:29 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee,
	Jonathan Tan

> +static int run_write_commit_graph(struct repository *r)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "-C", r->worktree,
> +			 "commit-graph", "write",
> +			 "--split", "--reachable",
> +			 NULL);

As mentioned in my reply to an earlier patch (sent a few minutes ago),
this won't work if there are environment variables like GIT_DIR present.

Besides that, the overall design looks reasonable.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 12/21] maintenance: add fetch task
  2020-07-07 14:21 ` [PATCH 12/21] maintenance: add fetch task Derrick Stolee via GitGitGadget
@ 2020-07-09  2:35   ` Jonathan Tan
  0 siblings, 0 replies; 164+ messages in thread
From: Jonathan Tan @ 2020-07-09  2:35 UTC (permalink / raw)
  To: gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee,
	Jonathan Tan

> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>    longer appear on the remote.

Having a ref path where Git can place commit IDs that it needs persisted
is useful, not only in this case but in other cases (e.g. when fetching
a submodule commit by hash, we might not have a ref name for that commit
but want to persist it anyway), so I look forward to having something
like this.

The name of this special ref path and its specific nature could be
discussed further, but maybe it is sufficient for now to just say that
the refs under this special ref path are controlled by Git, and their
layout is experimental and subject to change (e.g. future versions of
Git could just erase the entire path and rewrite the refs its own way).

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 04/21] gc: drop the_repository in log location
  2020-07-09  2:22   ` Jonathan Tan
@ 2020-07-09 11:13     ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 11:13 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee

On 7/8/2020 10:22 PM, Jonathan Tan wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The report_last_gc_error() method use git_pathdup() which implicitly
>> uses the_repository. Replace this with strbuf_repo_path() to get a
>> path buffer we control that uses a given repository pointer.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> 
> Regarding the first 4 patches, it would be great if there was a test
> like the one in test_repository.c (together with a code coverage report,
> perhaps) that verifies that all code paths do not use the_repository.
> 
> But set aside that test for now - I don't think gc.c fully supports
> arbitrary repositories. In particular, when running a subprocess, it
> inherits the environment from the current process. I see that future
> patches try to resolve this by passing "-C", but that does not work if
> environment variables like GIT_DIR are set (because the environment
> variables override the "-C"). Perhaps we need a function that runs a
> subprocess in a specific repository. I ran into the same problem when
> attempting to make fetch-pack (which runs index-pack as a subprocess)
> support arbitrary repositories, but I haven't looked deeply into
> resolving this yet (and I haven't looked at that problem in a while).
> 
> Having said that, I'm fine with these patches being in the set - the
> only negative is that perhaps a reader would be misled into thinking
> that GC supports arbitrary repositories.

I agree. I hope that I do not give the impression that GC is now
safe for arbitrary repositories. I only thought that this was prudent
to do before I start taking new dependencies on the code.

It' probably time for someone to do a round of the_repository cleanups
again, and perhaps the RC window is a good time to think about that
(for submission after the release).

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 09/21] maintenance: add commit-graph task
  2020-07-09  2:29   ` Jonathan Tan
@ 2020-07-09 11:14     ` Derrick Stolee
  2020-07-09 22:52       ` Jeff King
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 11:14 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee

On 7/8/2020 10:29 PM, Jonathan Tan wrote:
>> +static int run_write_commit_graph(struct repository *r)
>> +{
>> +	int result;
>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>> +
>> +	argv_array_pushl(&cmd, "-C", r->worktree,
>> +			 "commit-graph", "write",
>> +			 "--split", "--reachable",
>> +			 NULL);
> 
> As mentioned in my reply to an earlier patch (sent a few minutes ago),
> this won't work if there are environment variables like GIT_DIR present.

Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
the GIT_DIR?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-08 23:57 ` [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
@ 2020-07-09 11:21   ` Derrick Stolee
  2020-07-09 12:43     ` Derrick Stolee
  2020-07-09 14:05     ` Junio C Hamano
  0 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 11:21 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/8/2020 7:57 PM, Emily Shaffer wrote:
> On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> This is a second attempt at redesigning Git's repository maintenance
>> patterns. The first attempt [1] included a way to run jobs in the background
>> using a long-lived process; that idea was rejected and is not included in
>> this series. A future series will use the OS to handle scheduling tasks.
>>
>> [1] 
>> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
>>
>> As mentioned before, git gc already plays the role of maintaining Git
>> repositories. It has accumulated several smaller pieces in its long history,
>> including:
>>
>>  1. Repacking all reachable objects into one pack-file (and deleting
>>     unreachable objects).
>>  2. Packing refs.
>>  3. Expiring reflogs.
>>  4. Clearing rerere logs.
>>  5. Updating the commit-graph file.
>>
>> While expiring reflogs, clearing rererelogs, and deleting unreachable
>> objects are suitable under the guise of "garbage collection", packing refs
>> and updating the commit-graph file are not as obviously fitting. Further,
>> these operations are "all or nothing" in that they rewrite almost all
>> repository data, which does not perform well at extremely large scales.
>> These operations can also be disruptive to foreground Git commands when git
>> gc --auto triggers during routine use.
>>
>> This series does not intend to change what git gc does, but instead create
>> new choices for automatic maintenance activities, of which git gc remains
>> the only one enabled by default.
>>
>> The new maintenance tasks are:
>>
>>  * 'commit-graph' : write and verify a single layer of an incremental
>>    commit-graph.
>>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>>    a batch of loose objects.
>>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>>    repack using the multi-pack-index's incremental repack strategy.
>>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
>>
>> These tasks are all disabled by default, but can be enabled with config
>> options or run explicitly using "git maintenance run --task=". There are
>> additional config options to allow customizing the conditions for which the
>> tasks run during the '--auto' option. ('fetch' will never run with the
>> '--auto' option.)
>>
>>  Because 'gc' is implemented as a maintenance task, the most dramatic change
>> of this series is to convert the 'git gc --auto' calls into 'git maintenance
>> run --auto' calls at the end of some Git commands. By default, the only
>> change is that 'git gc --auto' will be run below an additional 'git
>> maintenance' process.
>>
>> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
>> later with subcommands that manage background maintenance, such as 'start',
>> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
>> it is important to focus on the maintenance activities themselves.
>>
>> An expert user could set up scheduled background maintenance themselves with
>> the current series. I have the following crontab data set up to run
>> maintenance on an hourly basis:
>>
>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
> 
> One thing I wonder about - now I have to go and make a new crontab
> (which is easy) or Task Scheduler task (which is a pain) for every repo,
> right?
> 
> Is it infeasible to ask for 'git maintenance' to learn something like
> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> config like "maintenance.targetRepo = /<path-to-repo>"?
> 
>>
>> My config includes all tasks except the 'gc' task. The hourly run is
>> over-aggressive, but is sufficient for testing. I'll replace it with daily
>> when I feel satisfied.
>>
>> Hopefully this direction is seen as a positive one. My goal was to add more
>> options for expert users, along with the flexibility to create background
>> maintenance via the OS in a later series.
>>
>> OUTLINE
>> =======
>>
>> Patches 1-4 remove some references to the_repository in builtin/gc.c before
>> we start depending on code in that builtin.
>>
>> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
>> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
>> commands.
> 
> For me, I'd prefer to see 'git maintenance run' get bigger and 'git gc
> --auto' get smaller or disappear. Is there a plan towards that
> direction, or is that out of scope for 'git maintenance run'? Similar
> examples I can think of include 'git annotate' and 'git pickaxe'.

Thanks for these examples of prior work. I'll keep them in mind.

>>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>>    default, but might have different '--auto' conditions and more config
>>    options.
> 
> Like I mentioned above, for me, I'd rather just see the 'gc' builtin go
> away :)

My hope is that we can absolutely do that. I didn't want to start that
exercise yet, as I don't want to disrupt existing workflows more than
I already am.

It is important to recognize that there are already several "tasks" that
run inside 'gc' including:

1. Expiring reflogs.
2. Repacking all reachable objects.
3. Deleting unreachable objects.
4. Packing refs.

Before trying to "remove" the gc builtin, we would want these to be
represented in the 'git maintenance run' as tasks.

In that direction, I realized after submitting that I should rename
the 'pack-files' task in this submission to 'incremental-repack'
instead, allowing a later 'full-repack' task to represent the role
of that step in the 'gc' task. Some users will prefer one over the
other. Perhaps this incremental/full distinction makes it clear that
there are trade-offs in both directions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 11:21   ` Derrick Stolee
@ 2020-07-09 12:43     ` Derrick Stolee
  2020-07-09 23:16       ` Jeff King
  2020-07-09 14:05     ` Junio C Hamano
  1 sibling, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 12:43 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 7:21 AM, Derrick Stolee wrote:
> On 7/8/2020 7:57 PM, Emily Shaffer wrote:
>> On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> An expert user could set up scheduled background maintenance themselves with
>>> the current series. I have the following crontab data set up to run
>>> maintenance on an hourly basis:
>>>
>>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
>>
>> One thing I wonder about - now I have to go and make a new crontab
>> (which is easy) or Task Scheduler task (which is a pain) for every repo,
>> right?
>>
>> Is it infeasible to ask for 'git maintenance' to learn something like
>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>> config like "maintenance.targetRepo = /<path-to-repo>"?

Sorry that I missed this comment on my first reply.

The intention is that this cron entry will be simpler after I follow up
with the "background" part of maintenance. The idea is to use global
or system config to register a list of repositories that want background
maintenance and have cron execute something like "git maintenance run --all-repos"
to span "git -C <repo> maintenance run --scheduled" for all repos in
the config.

For now, this manual setup does end up a bit cluttered if you have a
lot of repos to maintain.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 08/21] maintenance: initialize task array and hashmap
  2020-07-09  2:25   ` Jonathan Tan
@ 2020-07-09 13:15     ` Derrick Stolee
  2020-07-09 13:51       ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 13:15 UTC (permalink / raw)
  To: Jonathan Tan, gitgitgadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, derrickstolee, dstolee

On 7/8/2020 10:25 PM, Jonathan Tan wrote:
>> This list is also inserted into a hashmap. This allows command-line
>> arguments to quickly find the tasks by name, not sensitive to case. To
>> ensure this list and hashmap work well together, the list only contains
>> pointers to the struct information. This will allow a sort on the list
>> while preserving the hashmap data.
> 
> I think having the hashmap is unnecessarily complicated in this case -
> with the small number of tasks, a list would be fine. But I don't feel
> strongly about this.

You're probably right that iterating through a list with (hopefully)
at most a dozen entries is fast enough that a hashmap is overkill here.

Now is the real test: can I change this patch in v2 without needing
to mess with any of the others? The intention here was to make adding
tasks as simple as possible, so we shall see. :D

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 08/21] maintenance: initialize task array and hashmap
  2020-07-09 13:15     ` Derrick Stolee
@ 2020-07-09 13:51       ` Junio C Hamano
  0 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-09 13:51 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jonathan Tan, gitgitgadget, git, Johannes.Schindelin, sandals,
	steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	derrickstolee, dstolee

Derrick Stolee <stolee@gmail.com> writes:

> On 7/8/2020 10:25 PM, Jonathan Tan wrote:
>>> This list is also inserted into a hashmap. This allows command-line
>>> arguments to quickly find the tasks by name, not sensitive to case. To
>>> ensure this list and hashmap work well together, the list only contains
>>> pointers to the struct information. This will allow a sort on the list
>>> while preserving the hashmap data.
>> 
>> I think having the hashmap is unnecessarily complicated in this case -
>> with the small number of tasks, a list would be fine. But I don't feel
>> strongly about this.
>
> You're probably right that iterating through a list with (hopefully)
> at most a dozen entries is fast enough that a hashmap is overkill here.
>
> Now is the real test: can I change this patch in v2 without needing
> to mess with any of the others? The intention here was to make adding
> tasks as simple as possible, so we shall see. :D

Adding a new element to a list would be simple no matter how the
list is represented.  But I think the real question is what access
pattern we expect.  Do we need to look up by name a single one or
selected few?  Do we need the iteration/enumeration be stable?  etc.


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 11:21   ` Derrick Stolee
  2020-07-09 12:43     ` Derrick Stolee
@ 2020-07-09 14:05     ` Junio C Hamano
  2020-07-09 15:54       ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-09 14:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> It is important to recognize that there are already several "tasks" that
> run inside 'gc' including:
>
> 1. Expiring reflogs.
> 2. Repacking all reachable objects.
> 3. Deleting unreachable objects.
> 4. Packing refs.
>
> Before trying to "remove" the gc builtin, we would want these to be
> represented in the 'git maintenance run' as tasks.

Yup.  I like the overall direction of this approach to (1) have a
single subcommand that helps all the housekeeping tasks, and to (2)
make sure existing housekeeping tasks are supported by the new one.

I can understand why it is tempting to start with a new 'main()'
under a new subcommand name because we expect to add a lot more
tasks, but the name of that subcommand is much less important.

As can be seen in the list you have above, "gc" already does a lot
more than garbage collection (just #3 is the "garbage collection"
proper), as it has grown by following the same approach.

What's more important is (2) above.  While the tool has grown under
the same "gc" name, it was easier to arrange---it fell out naturally
as a consequence of the development being an enhancement on top of
the prior work.  Now that we are reimplementing, we need to actively
care.  As long as we recognize that, I am perfectly happy with the
current effort.

For existing callers, "git gc --auto" may want to be left alive,
merely as a thin wrapper around "git maintenance --auto", and as
long as the latter is done in the same spirit of the former, i.e.
perform a lightweight check to see if the repository is so out of
shape and then do a minimum cleaning, it would be welcomed by users
if it does a lot more than the current "git gc --auto".

Thanks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 14:05     ` Junio C Hamano
@ 2020-07-09 15:54       ` Derrick Stolee
  2020-07-09 16:26         ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 15:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 10:05 AM, Junio C Hamano wrote:
> For existing callers, "git gc --auto" may want to be left alive,
> merely as a thin wrapper around "git maintenance --auto", and as
> long as the latter is done in the same spirit of the former, i.e.
> perform a lightweight check to see if the repository is so out of
> shape and then do a minimum cleaning, it would be welcomed by users
> if it does a lot more than the current "git gc --auto".

It's entirely possible that (after the 'maintenance' builtin
stabilizes) that we make 'git gc --auto' become an alias of something
like 'git maintenance run --task=gc --auto' (or itemize all of the
sub-tasks) so that 'git gc --auto' doesn't change behavior.

That's a big motivation for adding all code into builtin/gc.c so
we can access these tasks inside GC without needing to move or
copy the code. I'm trying to preserve history as much as possible.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 15:54       ` Derrick Stolee
@ 2020-07-09 16:26         ` Junio C Hamano
  2020-07-09 16:56           ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-09 16:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 7/9/2020 10:05 AM, Junio C Hamano wrote:
>> For existing callers, "git gc --auto" may want to be left alive,
>> merely as a thin wrapper around "git maintenance --auto", and as
>> long as the latter is done in the same spirit of the former, i.e.
>> perform a lightweight check to see if the repository is so out of
>> shape and then do a minimum cleaning, it would be welcomed by users
>> if it does a lot more than the current "git gc --auto".
>
> It's entirely possible that (after the 'maintenance' builtin
> stabilizes) that we make 'git gc --auto' become an alias of something
> like 'git maintenance run --task=gc --auto' (or itemize all of the
> sub-tasks) so that 'git gc --auto' doesn't change behavior.

Yes, it is possible, but I doubt it is desirable.

The current users of "gc --auto" do not (and should not) care the
details of what tasks are performed.  We surely have added more
stuff that need maintenance since "gc --auto" was originally
written, and after people have started using "gc --auto" in their
workflows.  For example, I think "gc --auto" predates "rerere gc"
and those who had "gc --auto" in their script had a moment when
suddenly it started to clean stale entries in the rerere database.

Were they got upset when it happened?  Will they get upset when it
starts cleaning up stale commit-graph leftover files?

As long as "gc --auto" kept the same spirit of doing a lightweight
check to see if the repository is so out of shape to require
cleaning and performing a minimum maintenance when it started
calling "rerere gc", and as long as "maintenance --auto" does the
same, I would think the users would be delighted without complaints.

So, I wouldn't worry too much about what exactly happens with the
future versions of "gc --auto".  The world has changed, and we have
more items in the repository that needs maintenance/cruft removal.
The command in the new world should deal with these new stuff, too.

Thanks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 16:26         ` Junio C Hamano
@ 2020-07-09 16:56           ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 16:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 12:26 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/9/2020 10:05 AM, Junio C Hamano wrote:
>>> For existing callers, "git gc --auto" may want to be left alive,
>>> merely as a thin wrapper around "git maintenance --auto", and as
>>> long as the latter is done in the same spirit of the former, i.e.
>>> perform a lightweight check to see if the repository is so out of
>>> shape and then do a minimum cleaning, it would be welcomed by users
>>> if it does a lot more than the current "git gc --auto".
>>
>> It's entirely possible that (after the 'maintenance' builtin
>> stabilizes) that we make 'git gc --auto' become an alias of something
>> like 'git maintenance run --task=gc --auto' (or itemize all of the
>> sub-tasks) so that 'git gc --auto' doesn't change behavior.
> 
> Yes, it is possible, but I doubt it is desirable.
> 
> The current users of "gc --auto" do not (and should not) care the
> details of what tasks are performed.  We surely have added more
> stuff that need maintenance since "gc --auto" was originally
> written, and after people have started using "gc --auto" in their
> workflows.  For example, I think "gc --auto" predates "rerere gc"
> and those who had "gc --auto" in their script had a moment when
> suddenly it started to clean stale entries in the rerere database.
> 
> Were they got upset when it happened?  Will they get upset when it
> starts cleaning up stale commit-graph leftover files?
> 
> As long as "gc --auto" kept the same spirit of doing a lightweight
> check to see if the repository is so out of shape to require
> cleaning and performing a minimum maintenance when it started
> calling "rerere gc", and as long as "maintenance --auto" does the
> same, I would think the users would be delighted without complaints.
> 
> So, I wouldn't worry too much about what exactly happens with the
> future versions of "gc --auto".  The world has changed, and we have
> more items in the repository that needs maintenance/cruft removal.
> The command in the new world should deal with these new stuff, too.

Sounds good to me. The extra context around this helps a lot!

-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 09/21] maintenance: add commit-graph task
  2020-07-09 11:14     ` Derrick Stolee
@ 2020-07-09 22:52       ` Jeff King
  2020-07-09 23:41         ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Jeff King @ 2020-07-09 22:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jonathan Tan, gitgitgadget, git, Johannes.Schindelin, sandals,
	steadmon, jrnieder, congdanhqx, phillip.wood123, derrickstolee,
	dstolee

On Thu, Jul 09, 2020 at 07:14:41AM -0400, Derrick Stolee wrote:

> On 7/8/2020 10:29 PM, Jonathan Tan wrote:
> >> +static int run_write_commit_graph(struct repository *r)
> >> +{
> >> +	int result;
> >> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> >> +
> >> +	argv_array_pushl(&cmd, "-C", r->worktree,
> >> +			 "commit-graph", "write",
> >> +			 "--split", "--reachable",
> >> +			 NULL);
> > 
> > As mentioned in my reply to an earlier patch (sent a few minutes ago),
> > this won't work if there are environment variables like GIT_DIR present.
> 
> Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
> the GIT_DIR?

We do pass GIT_DIR to the subcommand, and "-C" does not override it. I
think this code would work as long as "r" is the_repository, which it
would be in the current code. But then the "-C" would be doing nothing
useful (it might change to the top of the worktree if we weren't there
for some reason, but I don't think "commit-graph write" would care
either way).

But if "r" is some other repository, "commit-graph" would continue to
operate in the parent process repository because of the inherited
GIT_DIR. Using "--git-dir" would solve that, but as a general practice,
if you're spawning a sub-process that might be in another repository,
you should clear any repo-specific environment variables. The list is in
local_repo_env, which you can feed to the "env" or "env_array" parameter
of a child_process (see the use in connect.c for an example).

Even in the current scheme where "r" is always the_repository, I suspect
this might still be buggy. If we're in a bare repository, presumably
r->worktree would be NULL.

-Peff

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 12:43     ` Derrick Stolee
@ 2020-07-09 23:16       ` Jeff King
  2020-07-09 23:45         ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Jeff King @ 2020-07-09 23:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:

> >> Is it infeasible to ask for 'git maintenance' to learn something like
> >> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> >> config like "maintenance.targetRepo = /<path-to-repo>"?
> 
> Sorry that I missed this comment on my first reply.
> 
> The intention is that this cron entry will be simpler after I follow up
> with the "background" part of maintenance. The idea is to use global
> or system config to register a list of repositories that want background
> maintenance and have cron execute something like "git maintenance run --all-repos"
> to span "git -C <repo> maintenance run --scheduled" for all repos in
> the config.
> 
> For now, this manual setup does end up a bit cluttered if you have a
> lot of repos to maintain.

I think it might be useful to have a general command to run a subcommand
in a bunch of repositories. Something like:

  git for-each-repo --recurse /path/to/repos git maintenance ...

which would root around in /path/to/repos for any git-dirs and run "git
--git-dir=$GIT_DIR maintenance ..." on each of them.

And/or:

  git for-each-repo --config maintenance.repos git maintenance ...

which would pull the set of repos from the named config variable instead
of looking around the filesystem.

You could use either as a one-liner in the crontab (depending on which
is easier with your repo layout).

-Peff

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 09/21] maintenance: add commit-graph task
  2020-07-09 22:52       ` Jeff King
@ 2020-07-09 23:41         ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 23:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Tan, gitgitgadget, git, Johannes.Schindelin, sandals,
	steadmon, jrnieder, congdanhqx, phillip.wood123, derrickstolee,
	dstolee

On 7/9/2020 6:52 PM, Jeff King wrote:
> On Thu, Jul 09, 2020 at 07:14:41AM -0400, Derrick Stolee wrote:
> 
>> On 7/8/2020 10:29 PM, Jonathan Tan wrote:
>>>> +static int run_write_commit_graph(struct repository *r)
>>>> +{
>>>> +	int result;
>>>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>>>> +
>>>> +	argv_array_pushl(&cmd, "-C", r->worktree,
>>>> +			 "commit-graph", "write",
>>>> +			 "--split", "--reachable",
>>>> +			 NULL);
>>>
>>> As mentioned in my reply to an earlier patch (sent a few minutes ago),
>>> this won't work if there are environment variables like GIT_DIR present.
>>
>> Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
>> the GIT_DIR?
> 
> We do pass GIT_DIR to the subcommand, and "-C" does not override it. I
> think this code would work as long as "r" is the_repository, which it
> would be in the current code. But then the "-C" would be doing nothing
> useful (it might change to the top of the worktree if we weren't there
> for some reason, but I don't think "commit-graph write" would care
> either way).
> 
> But if "r" is some other repository, "commit-graph" would continue to
> operate in the parent process repository because of the inherited
> GIT_DIR. Using "--git-dir" would solve that, but as a general practice,
> if you're spawning a sub-process that might be in another repository,
> you should clear any repo-specific environment variables. The list is in
> local_repo_env, which you can feed to the "env" or "env_array" parameter
> of a child_process (see the use in connect.c for an example).
> 
> Even in the current scheme where "r" is always the_repository, I suspect
> this might still be buggy. If we're in a bare repository, presumably
> r->worktree would be NULL.

Ah. I'll investigate this more and work to create a way to
run a subcommand in a given repository. Your pointers will
help a lot.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 23:16       ` Jeff King
@ 2020-07-09 23:45         ` Derrick Stolee
  2020-07-10 18:46           ` Emily Shaffer
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-09 23:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On 7/9/2020 7:16 PM, Jeff King wrote:
> On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
> 
>>>> Is it infeasible to ask for 'git maintenance' to learn something like
>>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>>>> config like "maintenance.targetRepo = /<path-to-repo>"?
>>
>> Sorry that I missed this comment on my first reply.
>>
>> The intention is that this cron entry will be simpler after I follow up
>> with the "background" part of maintenance. The idea is to use global
>> or system config to register a list of repositories that want background
>> maintenance and have cron execute something like "git maintenance run --all-repos"
>> to span "git -C <repo> maintenance run --scheduled" for all repos in
>> the config.
>>
>> For now, this manual setup does end up a bit cluttered if you have a
>> lot of repos to maintain.
> 
> I think it might be useful to have a general command to run a subcommand
> in a bunch of repositories. Something like:
> 
>   git for-each-repo --recurse /path/to/repos git maintenance ...
> 
> which would root around in /path/to/repos for any git-dirs and run "git
> --git-dir=$GIT_DIR maintenance ..." on each of them.
> 
> And/or:
> 
>   git for-each-repo --config maintenance.repos git maintenance ...
> 
> which would pull the set of repos from the named config variable instead
> of looking around the filesystem.

Yes! This! That's a good way to make something generic that solves
the problem at hand, but might also have other applications! Most
excellent.

> You could use either as a one-liner in the crontab (depending on which
> is easier with your repo layout).

The hope is that we can have such a clean layout. I'm particularly
fond of the config option because users may want to opt-in to
background maintenance only on some repos, even if they put them
in a consistent location.

In the _far_ future, we might even want to add a repo to this
"maintenance.repos" list during 'git init' and 'git clone' so
this is automatic. It then becomes opt-out at that point, which
is why I saw the _far, far_ future.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 23:45         ` Derrick Stolee
@ 2020-07-10 18:46           ` Emily Shaffer
  2020-07-10 19:30             ` Son Luong Ngoc
  0 siblings, 1 reply; 164+ messages in thread
From: Emily Shaffer @ 2020-07-10 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On Thu, Jul 09, 2020 at 07:45:47PM -0400, Derrick Stolee wrote:
> 
> On 7/9/2020 7:16 PM, Jeff King wrote:
> > On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
> > 
> >>>> Is it infeasible to ask for 'git maintenance' to learn something like
> >>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> >>>> config like "maintenance.targetRepo = /<path-to-repo>"?
> >>
> >> Sorry that I missed this comment on my first reply.
> >>
> >> The intention is that this cron entry will be simpler after I follow up
> >> with the "background" part of maintenance. The idea is to use global
> >> or system config to register a list of repositories that want background
> >> maintenance and have cron execute something like "git maintenance run --all-repos"
> >> to span "git -C <repo> maintenance run --scheduled" for all repos in
> >> the config.
> >>
> >> For now, this manual setup does end up a bit cluttered if you have a
> >> lot of repos to maintain.
> > 
> > I think it might be useful to have a general command to run a subcommand
> > in a bunch of repositories. Something like:
> > 
> >   git for-each-repo --recurse /path/to/repos git maintenance ...
> > 
> > which would root around in /path/to/repos for any git-dirs and run "git
> > --git-dir=$GIT_DIR maintenance ..." on each of them.
> > 
> > And/or:
> > 
> >   git for-each-repo --config maintenance.repos git maintenance ...
> > 
> > which would pull the set of repos from the named config variable instead
> > of looking around the filesystem.
> 
> Yes! This! That's a good way to make something generic that solves
> the problem at hand, but might also have other applications! Most
> excellent.

I'm glad I wasn't the only one super geeked when I read this idea. I'd
use the heck out of this in my .bashrc too. Sounds awesome. I actually
had a short-lived fling last year with a script to summarize my
uncommitted changes in all repos at the beginning of every session
(dropped because it became one more thing to gloss over) and could have
really used this command.

> 
> > You could use either as a one-liner in the crontab (depending on which
> > is easier with your repo layout).
> 
> The hope is that we can have such a clean layout. I'm particularly
> fond of the config option because users may want to opt-in to
> background maintenance only on some repos, even if they put them
> in a consistent location.
> 
> In the _far_ future, we might even want to add a repo to this
> "maintenance.repos" list during 'git init' and 'git clone' so
> this is automatic. It then becomes opt-out at that point, which
> is why I saw the _far, far_ future.

Oh, I like this idea a lot. Then I can do something silly like

  alias reproclone="git clone --no-maintainenance"

and get the benefits on everything else that I plan to be using
frequently.

 - Emily

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-10 18:46           ` Emily Shaffer
@ 2020-07-10 19:30             ` Son Luong Ngoc
  0 siblings, 0 replies; 164+ messages in thread
From: Son Luong Ngoc @ 2020-07-10 19:30 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Derrick Stolee, Jeff King, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee



> On Jul 10, 2020, at 20:46, Emily Shaffer <emilyshaffer@google.com> wrote:
> 
> On Thu, Jul 09, 2020 at 07:45:47PM -0400, Derrick Stolee wrote:
>> 
>> On 7/9/2020 7:16 PM, Jeff King wrote:
>>> On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
>>> 
>>>>>> Is it infeasible to ask for 'git maintenance' to learn something like
>>>>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>>>>>> config like "maintenance.targetRepo = /<path-to-repo>"?
>>>> 
>>>> Sorry that I missed this comment on my first reply.
>>>> 
>>>> The intention is that this cron entry will be simpler after I follow up
>>>> with the "background" part of maintenance. The idea is to use global
>>>> or system config to register a list of repositories that want background
>>>> maintenance and have cron execute something like "git maintenance run --all-repos"
>>>> to span "git -C <repo> maintenance run --scheduled" for all repos in
>>>> the config.
>>>> 
>>>> For now, this manual setup does end up a bit cluttered if you have a
>>>> lot of repos to maintain.
>>> 
>>> I think it might be useful to have a general command to run a subcommand
>>> in a bunch of repositories. Something like:
>>> 
>>>  git for-each-repo --recurse /path/to/repos git maintenance ...
>>> 
>>> which would root around in /path/to/repos for any git-dirs and run "git
>>> --git-dir=$GIT_DIR maintenance ..." on each of them.
>>> 
>>> And/or:
>>> 
>>>  git for-each-repo --config maintenance.repos git maintenance ...
>>> 
>>> which would pull the set of repos from the named config variable instead
>>> of looking around the filesystem.
>> 
>> Yes! This! That's a good way to make something generic that solves
>> the problem at hand, but might also have other applications! Most
>> excellent.
> 
> I'm glad I wasn't the only one super geeked when I read this idea. I'd
> use the heck out of this in my .bashrc too. Sounds awesome. I actually
> had a short-lived fling last year with a script to summarize my
> uncommitted changes in all repos at the beginning of every session
> (dropped because it became one more thing to gloss over) and could have
> really used this command.

I was planning to build a CLI tool that help manage multiple repos maintenance
like what was just described here.
My experience using my poor-man-scalar [1] bash script is: For multiple repositories,
the process count could get out of control quite quickly and there are probably other
issues that I have not thought of / encountered...

There is definitely a need to keep all the repos updated with pre-fetch 
and updated commit-graph, while staying compact / garbage free.
Having this in Git does simplify a lot of daily operations for end users.

> 
>> 
>>> You could use either as a one-liner in the crontab (depending on which
>>> is easier with your repo layout).
>> 
>> The hope is that we can have such a clean layout. I'm particularly
>> fond of the config option because users may want to opt-in to
>> background maintenance only on some repos, even if they put them
>> in a consistent location.
>> 
>> In the _far_ future, we might even want to add a repo to this
>> "maintenance.repos" list during 'git init' and 'git clone' so
>> this is automatic. It then becomes opt-out at that point, which
>> is why I saw the _far, far_ future.
> 
> Oh, I like this idea a lot. Then I can do something silly like
> 
>  alias reproclone="git clone --no-maintainenance"
> 
> and get the benefits on everything else that I plan to be using
> frequently.

This started to remind me of automatic updates in some of the popular OS.
Where download/install/cleanup update of multiple software components are
managed under a single tool.

I wonder if this is the path git should take in the 'new world' that Junio mentioned. [2]

But I am also super geeked reading this. :)

> 
> - Emily

Regards,
Son Luong.

[1]: https://github.com/sluongng/git-care
[2]: https://lore.kernel.org/git/xmqqmu48y7rw.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-07 14:21 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Derrick Stolee via GitGitGadget
                   ` (21 preceding siblings ...)
  2020-07-08 23:57 ` [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
@ 2020-07-23 17:56 ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
                     ` (19 more replies)
  22 siblings, 20 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee

This is a second attempt at redesigning Git's repository maintenance
patterns. The first attempt [1] included a way to run jobs in the background
using a long-lived process; that idea was rejected and is not included in
this series. A future series will use the OS to handle scheduling tasks.

[1] 
https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/

As mentioned before, git gc already plays the role of maintaining Git
repositories. It has accumulated several smaller pieces in its long history,
including:

 1. Repacking all reachable objects into one pack-file (and deleting
    unreachable objects).
 2. Packing refs.
 3. Expiring reflogs.
 4. Clearing rerere logs.
 5. Updating the commit-graph file.
 6. Pruning worktrees.

While expiring reflogs, clearing rererelogs, and deleting unreachable
objects are suitable under the guise of "garbage collection", packing refs
and updating the commit-graph file are not as obviously fitting. Further,
these operations are "all or nothing" in that they rewrite almost all
repository data, which does not perform well at extremely large scales.
These operations can also be disruptive to foreground Git commands when git
gc --auto triggers during routine use.

This series does not intend to change what git gc does, but instead create
new choices for automatic maintenance activities, of which git gc remains
the only one enabled by default.

The new maintenance tasks are:

 * 'commit-graph' : write and verify a single layer of an incremental
   commit-graph.
 * 'loose-objects' : prune packed loose objects, then create a new pack from
   a batch of loose objects.
 * 'pack-files' : expire redundant packs from the multi-pack-index, then
   repack using the multi-pack-index's incremental repack strategy.
 * 'fetch' : fetch from each remote, storing the refs in 'refs/prefetch//'.

These tasks are all disabled by default, but can be enabled with config
options or run explicitly using "git maintenance run --task=". There are
additional config options to allow customizing the conditions for which the
tasks run during the '--auto' option. ('fetch' will never run with the
'--auto' option.)

 Because 'gc' is implemented as a maintenance task, the most dramatic change
of this series is to convert the 'git gc --auto' calls into 'git maintenance
run --auto' calls at the end of some Git commands. By default, the only
change is that 'git gc --auto' will be run below an additional 'git
maintenance' process.

The 'git maintenance' builtin has a 'run' subcommand so it can be extended
later with subcommands that manage background maintenance, such as 'start',
'stop', 'pause', or 'schedule'. These are not the subject of this series, as
it is important to focus on the maintenance activities themselves.

An expert user could set up scheduled background maintenance themselves with
the current series. I have the following crontab data set up to run
maintenance on an hourly basis:

0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

My config includes all tasks except the 'gc' task. The hourly run is
over-aggressive, but is sufficient for testing. I'll replace it with daily
when I feel satisfied.

Hopefully this direction is seen as a positive one. My goal was to add more
options for expert users, along with the flexibility to create background
maintenance via the OS in a later series.

OUTLINE
=======

Patches 1-4 remove some references to the_repository in builtin/gc.c before
we start depending on code in that builtin.

Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
commands.

Patches 8-15 create new maintenance tasks. These are the same tasks sent in
the previous RFC.

Patches 16-21 create more customization through config and perform other
polish items.

FUTURE WORK
===========

 * Add 'start', 'stop', and 'schedule' subcommands to initialize the
   commands run in the background. You can see my progress towards this goal
   here: https://github.com/gitgitgadget/git/pull/680
   
   
 * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
   default, but might have different '--auto' conditions and more config
   options.
   
   
 * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
   with use of the 'commit-graph' task.
   
   
 * Update the builtin to use struct repository *r properly, especially when
   calling subcommands.
   
   

UPDATES in V2
=============

I'm sending this between v2.28.0-rc2 adn v2.28.0 as the release things have
become a bit quiet.

 * The biggest disruption to the range-diff is that I removed the premature
   use of struct repository *r and instead continue to rely on 
   the_repository. This means several patches were dropped that did prep
   work in builtin/gc.c.
   
   
 * I dropped the task hashmap and opted for a linear scan. This task list
   will always be too small to justify the extra complication of the
   hashmap.
   
   
 * struct maintenance_opts is properly static now.
   
   
 * Some tasks are renamed: fetch -> prefetch, pack-files ->
   incremental-repack.
   
   
 * With the rename, the prefetch task uses refs/prefetch/ instead of 
   refs/hidden/.
   
   
 * A trace2 region around the task executions are added.
   
   

Thanks, -Stolee

Derrick Stolee (18):
  maintenance: create basic maintenance runner
  maintenance: add --quiet option
  maintenance: replace run_auto_gc()
  maintenance: initialize task array
  maintenance: add commit-graph task
  maintenance: add --task option
  maintenance: take a lock on the objects directory
  maintenance: add prefetch task
  maintenance: add loose-objects task
  maintenance: add incremental-repack task
  maintenance: auto-size incremental-repack batch
  maintenance: create maintenance.<task>.enabled config
  maintenance: use pointers to check --auto
  maintenance: add auto condition for commit-graph task
  maintenance: create auto condition for loose-objects
  maintenance: add incremental-repack auto condition
  midx: use start_delayed_progress()
  maintenance: add trace2 regions for task execution

 .gitignore                           |   1 +
 Documentation/config.txt             |   2 +
 Documentation/config/maintenance.txt |  32 ++
 Documentation/fetch-options.txt      |   5 +-
 Documentation/git-clone.txt          |   7 +-
 Documentation/git-maintenance.txt    | 124 +++++
 builtin.h                            |   1 +
 builtin/am.c                         |   2 +-
 builtin/commit.c                     |   2 +-
 builtin/fetch.c                      |   6 +-
 builtin/gc.c                         | 753 +++++++++++++++++++++++++++
 builtin/merge.c                      |   2 +-
 builtin/rebase.c                     |   4 +-
 commit-graph.c                       |   8 +-
 commit-graph.h                       |   1 +
 git.c                                |   1 +
 midx.c                               |  12 +-
 midx.h                               |   1 +
 object.h                             |   1 +
 run-command.c                        |   7 +-
 run-command.h                        |   2 +-
 t/t5319-multi-pack-index.sh          |  14 +-
 t/t5510-fetch.sh                     |   2 +-
 t/t5514-fetch-multiple.sh            |   2 +-
 t/t7900-maintenance.sh               | 211 ++++++++
 25 files changed, 1169 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh


base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/671

Range-diff vs v1:

  1:  85dda7db28 <  -:  ---------- gc: use the_repository less often
  2:  303ad4bdc7 <  -:  ---------- gc: use repository in too_many_loose_objects()
  3:  9dfa9e8f6f <  -:  ---------- gc: use repo config
  4:  e13d351e9f <  -:  ---------- gc: drop the_repository in log location
  5:  5f89e0ec9b !  1:  63ec602a07 maintenance: create basic maintenance runner
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      +	NULL
      +};
      +
     -+struct maintenance_opts {
     ++static struct maintenance_opts {
      +	int auto_flag;
      +} opts;
      +
     -+static int maintenance_task_gc(struct repository *r)
     ++static int maintenance_task_gc(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      +	if (opts.auto_flag)
      +		argv_array_pushl(&cmd, "--auto", NULL);
      +
     -+	close_object_store(r->objects);
     ++	close_object_store(the_repository->objects);
      +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +	argv_array_clear(&cmd);
      +
      +	return result;
      +}
      +
     -+static int maintenance_run(struct repository *r)
     ++static int maintenance_run(void)
      +{
     -+	return maintenance_task_gc(r);
     ++	return maintenance_task_gc();
      +}
      +
      +int cmd_maintenance(int argc, const char **argv, const char *prefix)
      +{
     -+	struct repository *r = the_repository;
     -+
      +	static struct option builtin_maintenance_options[] = {
      +		OPT_BOOL(0, "auto", &opts.auto_flag,
      +			 N_("run tasks based on the state of the repository")),
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      +
      +	if (argc == 1) {
      +		if (!strcmp(argv[0], "run"))
     -+			return maintenance_run(r);
     ++			return maintenance_run();
      +	}
      +
      +	usage_with_options(builtin_maintenance_usage,
  6:  018a9331e2 !  2:  1d37e55cb7 maintenance: add --quiet option
     @@ Documentation/git-maintenance.txt: OPTIONS
       ## builtin/gc.c ##
      @@ builtin/gc.c: static const char * const builtin_maintenance_usage[] = {
       
     - struct maintenance_opts {
     + static struct maintenance_opts {
       	int auto_flag;
      +	int quiet;
       } opts;
       
     - static int maintenance_task_gc(struct repository *r)
     -@@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
     + static int maintenance_task_gc(void)
     +@@ builtin/gc.c: static int maintenance_task_gc(void)
       
       	if (opts.auto_flag)
       		argv_array_pushl(&cmd, "--auto", NULL);
      +	if (opts.quiet)
      +		argv_array_pushl(&cmd, "--quiet", NULL);
       
     - 	close_object_store(r->objects);
     + 	close_object_store(the_repository->objects);
       	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      @@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
       	static struct option builtin_maintenance_options[] = {
  7:  335a8938c6 =  3:  f164d1a0b4 maintenance: replace run_auto_gc()
  8:  5cdd38afa6 !  4:  8e260bccf1 maintenance: initialize task array and hashmap
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    maintenance: initialize task array and hashmap
     +    maintenance: initialize task array
      
          In anticipation of implementing multiple maintenance tasks inside the
     -    'maintenance' builtin, use a list and hashmap of structs to describe the
     -    work to be done.
     +    'maintenance' builtin, use a list of structs to describe the work to be
     +    done.
      
          The struct maintenance_task stores the name of the task (as given by a
          future command-line argument) along with a function pointer to its
     @@ Commit message
          contains the "gc" task. This task is also the only task enabled by
          default.
      
     -    This list is also inserted into a hashmap. This allows command-line
     -    arguments to quickly find the tasks by name, not sensitive to case. To
     -    ensure this list and hashmap work well together, the list only contains
     -    pointers to the struct information. This will allow a sort on the list
     -    while preserving the hashmap data.
     -
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/gc.c ##
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
       static const char * const builtin_maintenance_usage[] = {
       	N_("git maintenance run [<options>]"),
       	NULL
     -@@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_task_gc(void)
       	return result;
       }
       
     -+typedef int maintenance_task_fn(struct repository *r);
     ++typedef int maintenance_task_fn(void);
      +
      +struct maintenance_task {
     -+	struct hashmap_entry ent;
      +	const char *name;
      +	maintenance_task_fn *fn;
      +	unsigned enabled:1;
      +};
      +
     -+static int task_entry_cmp(const void *unused_cmp_data,
     -+			  const struct hashmap_entry *eptr,
     -+			  const struct hashmap_entry *entry_or_key,
     -+			  const void *keydata)
     -+{
     -+	const struct maintenance_task *e1, *e2;
     -+	const char *name = keydata;
     -+
     -+	e1 = container_of(eptr, const struct maintenance_task, ent);
     -+	e2 = container_of(entry_or_key, const struct maintenance_task, ent);
     -+
     -+	return strcasecmp(e1->name, name ? name : e2->name);
     -+}
     ++static struct maintenance_task *tasks[MAX_NUM_TASKS];
     ++static int num_tasks;
      +
     -+struct maintenance_task *tasks[MAX_NUM_TASKS];
     -+int num_tasks;
     -+struct hashmap task_map;
     -+
     - static int maintenance_run(struct repository *r)
     + static int maintenance_run(void)
       {
     --	return maintenance_task_gc(r);
     +-	return maintenance_task_gc();
      +	int i;
      +	int result = 0;
      +
      +	for (i = 0; !result && i < num_tasks; i++) {
      +		if (!tasks[i]->enabled)
      +			continue;
     -+		result = tasks[i]->fn(r);
     ++		result = tasks[i]->fn();
      +	}
      +
      +	return result;
     @@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
      +	tasks[num_tasks]->fn = maintenance_task_gc;
      +	tasks[num_tasks]->enabled = 1;
      +	num_tasks++;
     -+
     -+	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
     -+
     -+	for (i = 0; i < num_tasks; i++) {
     -+		hashmap_entry_init(&tasks[i]->ent,
     -+				   strihash(tasks[i]->name));
     -+		hashmap_add(&task_map, &tasks[i]->ent);
     -+	}
       }
       
       int cmd_maintenance(int argc, const char **argv, const char *prefix)
  9:  c8fbd14d41 !  5:  04552b1d2e maintenance: add commit-graph task
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
       
       static const char * const builtin_maintenance_usage[] = {
       	N_("git maintenance run [<options>]"),
     -@@ builtin/gc.c: struct maintenance_opts {
     +@@ builtin/gc.c: static struct maintenance_opts {
       	int quiet;
       } opts;
       
     -+static int run_write_commit_graph(struct repository *r)
     ++static int run_write_commit_graph(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
      +
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "commit-graph", "write",
     -+			 "--split", "--reachable",
     -+			 NULL);
     ++	argv_array_pushl(&cmd, "commit-graph", "write",
     ++			 "--split", "--reachable", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_pushl(&cmd, "--no-progress", NULL);
     ++		argv_array_push(&cmd, "--no-progress");
      +
      +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +	argv_array_clear(&cmd);
     @@ builtin/gc.c: struct maintenance_opts {
      +	return result;
      +}
      +
     -+static int run_verify_commit_graph(struct repository *r)
     ++static int run_verify_commit_graph(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
      +
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "commit-graph", "verify",
     ++	argv_array_pushl(&cmd, "commit-graph", "verify",
      +			 "--shallow", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_pushl(&cmd, "--no-progress", NULL);
     ++		argv_array_push(&cmd, "--no-progress");
      +
      +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +	argv_array_clear(&cmd);
     @@ builtin/gc.c: struct maintenance_opts {
      +	return result;
      +}
      +
     -+static int maintenance_task_commit_graph(struct repository *r)
     ++static int maintenance_task_commit_graph(void)
      +{
     ++	struct repository *r = the_repository;
      +	char *chain_path;
      +
      +	/* Skip commit-graph when --auto is specified. */
     @@ builtin/gc.c: struct maintenance_opts {
      +		return 0;
      +
      +	close_object_store(r->objects);
     -+	if (run_write_commit_graph(r)) {
     ++	if (run_write_commit_graph()) {
      +		error(_("failed to write commit-graph"));
      +		return 1;
      +	}
      +
     -+	if (!run_verify_commit_graph(r))
     ++	if (!run_verify_commit_graph())
      +		return 0;
      +
      +	warning(_("commit-graph verify caught error, rewriting"));
     @@ builtin/gc.c: struct maintenance_opts {
      +	}
      +	free(chain_path);
      +
     -+	if (!run_write_commit_graph(r))
     ++	if (!run_write_commit_graph())
      +		return 0;
      +
      +	error(_("failed to rewrite commit-graph"));
      +	return 1;
      +}
      +
     - static int maintenance_task_gc(struct repository *r)
     + static int maintenance_task_gc(void)
       {
       	int result;
      @@ builtin/gc.c: static void initialize_tasks(void)
     + 	tasks[num_tasks]->fn = maintenance_task_gc;
       	tasks[num_tasks]->enabled = 1;
       	num_tasks++;
     - 
     ++
      +	tasks[num_tasks]->name = "commit-graph";
      +	tasks[num_tasks]->fn = maintenance_task_commit_graph;
      +	num_tasks++;
     -+
     - 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
     + }
       
     - 	for (i = 0; i < num_tasks; i++) {
     + int cmd_maintenance(int argc, const char **argv, const char *prefix)
      
       ## commit-graph.c ##
      @@ commit-graph.c: static char *get_split_graph_filename(struct object_directory *odb,
 10:  c081a3bd29 !  6:  a09b1c1687 maintenance: add --task option
     @@ Documentation/git-maintenance.txt: OPTIONS
      
       ## builtin/gc.c ##
      @@ builtin/gc.c: static const char * const builtin_maintenance_usage[] = {
     - struct maintenance_opts {
     + static struct maintenance_opts {
       	int auto_flag;
       	int quiet;
      +	int tasks_selected;
       } opts;
       
     - static int run_write_commit_graph(struct repository *r)
     -@@ builtin/gc.c: struct maintenance_task {
     - 	struct hashmap_entry ent;
     + static int run_write_commit_graph(void)
     +@@ builtin/gc.c: typedef int maintenance_task_fn(void);
     + struct maintenance_task {
       	const char *name;
       	maintenance_task_fn *fn;
      -	unsigned enabled:1;
     @@ builtin/gc.c: struct maintenance_task {
      +		 selected:1;
       };
       
     - static int task_entry_cmp(const void *unused_cmp_data,
     -@@ builtin/gc.c: struct maintenance_task *tasks[MAX_NUM_TASKS];
     - int num_tasks;
     - struct hashmap task_map;
     + static struct maintenance_task *tasks[MAX_NUM_TASKS];
     + static int num_tasks;
       
      +static int compare_tasks_by_selection(const void *a_, const void *b_)
      +{
     @@ builtin/gc.c: struct maintenance_task *tasks[MAX_NUM_TASKS];
      +	return b->task_order - a->task_order;
      +}
      +
     - static int maintenance_run(struct repository *r)
     + static int maintenance_run(void)
       {
       	int i;
       	int result = 0;
     @@ builtin/gc.c: struct maintenance_task *tasks[MAX_NUM_TASKS];
      +		if (!opts.tasks_selected && !tasks[i]->enabled)
       			continue;
      +
     - 		result = tasks[i]->fn(r);
     + 		result = tasks[i]->fn();
       	}
       
      @@ builtin/gc.c: static void initialize_tasks(void)
     - 	}
     + 	num_tasks++;
       }
       
      +static int task_option_parse(const struct option *opt,
      +			     const char *arg, int unset)
      +{
     -+	struct maintenance_task *task;
     -+	struct maintenance_task key;
     ++	int i;
     ++	struct maintenance_task *task = NULL;
      +
      +	BUG_ON_OPT_NEG(unset);
      +
     @@ builtin/gc.c: static void initialize_tasks(void)
      +
      +	opts.tasks_selected++;
      +
     -+	key.name = arg;
     -+	hashmap_entry_init(&key.ent, strihash(key.name));
     -+
     -+	task = hashmap_get_entry(&task_map, &key, ent, NULL);
     ++	for (i = 0; i < MAX_NUM_TASKS; i++) {
     ++		if (tasks[i] && !strcasecmp(tasks[i]->name, arg)) {
     ++			task = tasks[i];
     ++			break;
     ++		}
     ++	}
      +
      +	if (!task) {
      +		error(_("'%s' is not a valid task"), arg);
     @@ builtin/gc.c: static void initialize_tasks(void)
      +
       int cmd_maintenance(int argc, const char **argv, const char *prefix)
       {
     - 	struct repository *r = the_repository;
     + 	static struct option builtin_maintenance_options[] = {
      @@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
       			 N_("run tasks based on the state of the repository")),
       		OPT_BOOL(0, "quiet", &opts.quiet,
 11:  fc1fb5f3cc !  7:  e9260a9c3f maintenance: take a lock on the objects directory
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: static int maintenance_run(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_run(void)
       {
       	int i;
       	int result = 0;
      +	struct lock_file lk;
     ++	struct repository *r = the_repository;
      +	char *lock_path = xstrfmt("%s/maintenance", r->objects->odb->path);
      +
      +	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
     @@ builtin/gc.c: static int maintenance_run(struct repository *r)
       
       	if (opts.tasks_selected)
       		QSORT(tasks, num_tasks, compare_tasks_by_selection);
     -@@ builtin/gc.c: static int maintenance_run(struct repository *r)
     - 		result = tasks[i]->fn(r);
     +@@ builtin/gc.c: static int maintenance_run(void)
     + 		result = tasks[i]->fn();
       	}
       
      +	rollback_lock_file(&lk);
 12:  cbaa5ecc4f !  8:  3165b8916d maintenance: add fetch task
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    maintenance: add fetch task
     +    maintenance: add prefetch task
      
          When working with very large repositories, an incremental 'git fetch'
          command can download a large amount of data. If there are many other
     @@ Commit message
          the background. This can break up a large daily fetch into several
          smaller hourly fetches.
      
     +    The task is called "prefetch" because it is work done in advance
     +    of a foreground fetch to make that 'git fetch' command much faster.
     +
          However, if we simply ran 'git fetch <remote>' in the background,
          then the user running a foregroudn 'git fetch <remote>' would lose
          some important feedback when a new branch appears or an existing
     @@ Commit message
          2. --refmap= removes the configured refspec which usually updates
             refs/remotes/<remote>/* with the refs advertised by the remote.
      
     -    3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
     +    3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
             we can ensure that we actually load the new values somewhere in
             our refspace while not updating refs/heads or refs/remotes. By
             storing these refs here, the commit-graph job will update the
             commit-graph with the commits from these hidden refs.
      
     -    4. --prune will delete the refs/hidden/<remote> refs that no
     +    4. --prune will delete the refs/prefetch/<remote> refs that no
             longer appear on the remote.
      
          We've been using this step as a critical background job in Scalar
     @@ Documentation/git-maintenance.txt: since it will not expire `.graph` files that
       `commit-graph-chain` file. They will be deleted by a later run based on
       the expiration delay.
       
     -+fetch::
     -+	The `fetch` job updates the object directory with the latest objects
     ++prefetch::
     ++	The `fetch` task updates the object directory with the latest objects
      +	from all registered remotes. For each remote, a `git fetch` command
      +	is run. The refmap is custom to avoid updating local or remote
      +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
     -+	remote refs are stored in `refs/hidden/<remote>/`. Also, no tags are
     -+	updated.
     ++	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
     ++	not updated.
      ++
      +This means that foreground fetches are still required to update the
      +remote refs, but the users is notified when the branches and tags are
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
       
       static const char * const builtin_maintenance_usage[] = {
       	N_("git maintenance run [<options>]"),
     -@@ builtin/gc.c: static int maintenance_task_commit_graph(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_task_commit_graph(void)
       	return 1;
       }
       
     -+static int fetch_remote(struct repository *r, const char *remote)
     ++static int fetch_remote(const char *remote)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
      +	struct strbuf refmap = STRBUF_INIT;
      +
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "fetch", remote, "--prune",
     ++	argv_array_pushl(&cmd, "fetch", remote, "--prune",
      +			 "--no-tags", "--refmap=", NULL);
      +
     -+	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
     ++	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
      +	argv_array_push(&cmd, refmap.buf);
      +
      +	if (opts.quiet)
     @@ builtin/gc.c: static int maintenance_task_commit_graph(struct repository *r)
      +	return 0;
      +}
      +
     -+static int maintenance_task_fetch(struct repository *r)
     ++static int maintenance_task_prefetch(void)
      +{
      +	int result = 0;
      +	struct string_list_item *item;
     @@ builtin/gc.c: static int maintenance_task_commit_graph(struct repository *r)
      +	for (item = remotes.items;
      +	     item && item < remotes.items + remotes.nr;
      +	     item++)
     -+		fetch_remote(r, item->string);
     ++		fetch_remote(item->string);
      +
      +cleanup:
      +	string_list_clear(&remotes, 0);
      +	return result;
      +}
      +
     - static int maintenance_task_gc(struct repository *r)
     + static int maintenance_task_gc(void)
       {
       	int result;
      @@ builtin/gc.c: static void initialize_tasks(void)
       	for (i = 0; i < MAX_NUM_TASKS; i++)
       		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
       
     -+	tasks[num_tasks]->name = "fetch";
     -+	tasks[num_tasks]->fn = maintenance_task_fetch;
     ++	tasks[num_tasks]->name = "prefetch";
     ++	tasks[num_tasks]->fn = maintenance_task_prefetch;
      +	num_tasks++;
      +
       	tasks[num_tasks]->name = "gc";
     @@ t/t7900-maintenance.sh: test_expect_success 'run --task duplicate' '
       	test_i18ngrep "cannot be selected multiple times" err
       '
       
     -+test_expect_success 'run --task=fetch with no remotes' '
     -+	git maintenance run --task=fetch 2>err &&
     ++test_expect_success 'run --task=prefetch with no remotes' '
     ++	git maintenance run --task=prefetch 2>err &&
      +	test_must_be_empty err
      +'
      +
     -+test_expect_success 'fetch multiple remotes' '
     ++test_expect_success 'prefetch multiple remotes' '
      +	git clone . clone1 &&
      +	git clone . clone2 &&
      +	git remote add remote1 "file://$(pwd)/clone1" &&
     @@ t/t7900-maintenance.sh: test_expect_success 'run --task duplicate' '
      +	git -C clone2 switch -c two &&
      +	test_commit -C clone1 one &&
      +	test_commit -C clone2 two &&
     -+	GIT_TRACE2_EVENT="$(pwd)/run-fetch.txt" git maintenance run --task=fetch &&
     -+	grep ",\"fetch\",\"remote1\"" run-fetch.txt &&
     -+	grep ",\"fetch\",\"remote2\"" run-fetch.txt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch &&
     ++	grep ",\"fetch\",\"remote1\"" run-prefetch.txt &&
     ++	grep ",\"fetch\",\"remote2\"" run-prefetch.txt &&
      +	test_path_is_missing .git/refs/remotes &&
     -+	test_cmp clone1/.git/refs/heads/one .git/refs/hidden/remote1/one &&
     -+	test_cmp clone2/.git/refs/heads/two .git/refs/hidden/remote2/two &&
     -+	git log hidden/remote1/one &&
     -+	git log hidden/remote2/two
     ++	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
     ++	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
     ++	git log prefetch/remote1/one &&
     ++	git log prefetch/remote2/two
      +'
      +
       test_done
 13:  66a1f662ce !  9:  83648f4865 maintenance: add loose-objects task
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
       
       static const char * const builtin_maintenance_usage[] = {
       	N_("git maintenance run [<options>]"),
     -@@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_task_gc(void)
       	return result;
       }
       
     -+
     -+static int prune_packed(struct repository *r)
     ++static int prune_packed(void)
      +{
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "-C", r->worktree, "prune-packed", NULL);
     ++	argv_array_pushl(&cmd, "prune-packed", NULL);
      +
      +	if (opts.quiet)
      +		argv_array_push(&cmd, "--quiet");
     @@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
      +	return ++(d->count) > d->batch_size;
      +}
      +
     -+static int pack_loose(struct repository *r)
     ++static int pack_loose(void)
      +{
     ++	struct repository *r = the_repository;
      +	int result = 0;
      +	struct write_loose_object_data data;
      +	struct strbuf prefix = STRBUF_INIT;
     @@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
      +	strbuf_addstr(&prefix, r->objects->odb->path);
      +	strbuf_addstr(&prefix, "/pack/loose");
      +
     -+	argv_array_pushl(&pack_proc->args, "git", "-C", r->worktree,
     -+			 "pack-objects", NULL);
     ++	argv_array_pushl(&pack_proc->args, "git", "pack-objects", NULL);
      +	if (opts.quiet)
      +		argv_array_push(&pack_proc->args, "--quiet");
      +	argv_array_push(&pack_proc->args, prefix.buf);
     @@ builtin/gc.c: static int maintenance_task_gc(struct repository *r)
      +	return result;
      +}
      +
     -+static int maintenance_task_loose_objects(struct repository *r)
     ++static int maintenance_task_loose_objects(void)
      +{
     -+	return prune_packed(r) || pack_loose(r);
     ++	return prune_packed() || pack_loose();
      +}
      +
     - typedef int maintenance_task_fn(struct repository *r);
     + typedef int maintenance_task_fn(void);
       
       struct maintenance_task {
      @@ builtin/gc.c: static void initialize_tasks(void)
     - 	tasks[num_tasks]->fn = maintenance_task_fetch;
     + 	tasks[num_tasks]->fn = maintenance_task_prefetch;
       	num_tasks++;
       
      +	tasks[num_tasks]->name = "loose-objects";
     @@ builtin/gc.c: static void initialize_tasks(void)
       	tasks[num_tasks]->enabled = 1;
      
       ## t/t7900-maintenance.sh ##
     -@@ t/t7900-maintenance.sh: test_expect_success 'fetch multiple remotes' '
     - 	git log hidden/remote2/two
     +@@ t/t7900-maintenance.sh: test_expect_success 'prefetch multiple remotes' '
     + 	git log prefetch/remote2/two
       '
       
      +test_expect_success 'loose-objects task' '
 14:  f98790024f ! 10:  b6328c2106 maintenance: add pack-files task
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    maintenance: add pack-files task
     +    maintenance: add incremental-repack task
      
          The previous change cleaned up loose objects using the
          'loose-objects' that can be run safely in the background. Add a
     @@ Commit message
          2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
          (midx: implement midx_repack(), 2019-06-10).
      
     -    The 'pack-files' job runs the following steps:
     +    The 'incremental-repack' task runs the following steps:
      
          1. 'git multi-pack-index write' creates a multi-pack-index file if
             one did not exist, and otherwise will update the multi-pack-index
     @@ Commit message
             intention is that the resulting pack-file will be close in size
             to the provided batch size.
      
     -       The next run of the pack-files job will delete these repacked
     -       pack-files during the 'expire' step.
     +       The next run of the incremental-repack task will delete these
     +       repacked pack-files during the 'expire' step.
      
             In this version, the batch size is set to "0" which ignores the
             size restrictions when selecting the pack-files. It instead
     @@ Documentation/git-maintenance.txt: loose-objects::
       	thousand objects to prevent the job from taking too long on a
       	repository with many loose objects.
       
     -+pack-files::
     -+	The `pack-files` job incrementally repacks the object directory
     ++incremental-repack::
     ++	The `incremental-repack` job repacks the object directory
      +	using the `multi-pack-index` feature. In order to prevent race
      +	conditions with concurrent Git commands, it follows a two-step
      +	process. First, it deletes any pack-files included in the
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
       
       static const char * const builtin_maintenance_usage[] = {
       	N_("git maintenance run [<options>]"),
     -@@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
     - 	return prune_packed(r) || pack_loose(r);
     +@@ builtin/gc.c: static int maintenance_task_loose_objects(void)
     + 	return prune_packed() || pack_loose();
       }
       
     -+static int multi_pack_index_write(struct repository *r)
     ++static int multi_pack_index_write(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "multi-pack-index", "write", NULL);
     ++	argv_array_pushl(&cmd, "multi-pack-index", "write", NULL);
      +
      +	if (opts.quiet)
      +		argv_array_push(&cmd, "--no-progress");
     @@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
      +	return result;
      +}
      +
     -+static int rewrite_multi_pack_index(struct repository *r)
     ++static int rewrite_multi_pack_index(void)
      +{
     ++	struct repository *r = the_repository;
      +	char *midx_name = get_midx_filename(r->objects->odb->path);
      +
      +	unlink(midx_name);
      +	free(midx_name);
      +
     -+	if (multi_pack_index_write(r)) {
     ++	if (multi_pack_index_write()) {
      +		error(_("failed to rewrite multi-pack-index"));
      +		return 1;
      +	}
     @@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
      +	return 0;
      +}
      +
     -+static int multi_pack_index_verify(struct repository *r)
     ++static int multi_pack_index_verify(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "multi-pack-index", "verify", NULL);
     ++	argv_array_pushl(&cmd, "multi-pack-index", "verify", NULL);
      +
      +	if (opts.quiet)
      +		argv_array_push(&cmd, "--no-progress");
     @@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
      +	return result;
      +}
      +
     -+static int multi_pack_index_expire(struct repository *r)
     ++static int multi_pack_index_expire(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "multi-pack-index", "expire", NULL);
     ++	argv_array_pushl(&cmd, "multi-pack-index", "expire", NULL);
      +
      +	if (opts.quiet)
      +		argv_array_push(&cmd, "--no-progress");
      +
     -+	close_object_store(r->objects);
     ++	close_object_store(the_repository->objects);
      +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +	argv_array_clear(&cmd);
      +
      +	return result;
      +}
      +
     -+static int multi_pack_index_repack(struct repository *r)
     ++static int multi_pack_index_repack(void)
      +{
      +	int result;
      +	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "-C", r->worktree,
     -+			 "multi-pack-index", "repack", NULL);
     ++	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
      +
      +	if (opts.quiet)
      +		argv_array_push(&cmd, "--no-progress");
      +
      +	argv_array_push(&cmd, "--batch-size=0");
      +
     -+	close_object_store(r->objects);
     ++	close_object_store(the_repository->objects);
      +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +
     -+	if (result && multi_pack_index_verify(r)) {
     ++	if (result && multi_pack_index_verify()) {
      +		warning(_("multi-pack-index verify failed after repack"));
     -+		result = rewrite_multi_pack_index(r);
     ++		result = rewrite_multi_pack_index();
      +	}
      +
      +	return result;
      +}
      +
     -+static int maintenance_task_pack_files(struct repository *r)
     ++static int maintenance_task_incremental_repack(void)
      +{
     -+	if (multi_pack_index_write(r)) {
     ++	if (multi_pack_index_write()) {
      +		error(_("failed to write multi-pack-index"));
      +		return 1;
      +	}
      +
     -+	if (multi_pack_index_verify(r)) {
     ++	if (multi_pack_index_verify()) {
      +		warning(_("multi-pack-index verify failed after initial write"));
     -+		return rewrite_multi_pack_index(r);
     ++		return rewrite_multi_pack_index();
      +	}
      +
     -+	if (multi_pack_index_expire(r)) {
     ++	if (multi_pack_index_expire()) {
      +		error(_("multi-pack-index expire failed"));
      +		return 1;
      +	}
      +
     -+	if (multi_pack_index_verify(r)) {
     ++	if (multi_pack_index_verify()) {
      +		warning(_("multi-pack-index verify failed after expire"));
     -+		return rewrite_multi_pack_index(r);
     ++		return rewrite_multi_pack_index();
      +	}
      +
     -+	if (multi_pack_index_repack(r)) {
     ++	if (multi_pack_index_repack()) {
      +		error(_("multi-pack-index repack failed"));
      +		return 1;
      +	}
     @@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
      +	return 0;
      +}
      +
     - typedef int maintenance_task_fn(struct repository *r);
     + typedef int maintenance_task_fn(void);
       
       struct maintenance_task {
      @@ builtin/gc.c: static void initialize_tasks(void)
       	tasks[num_tasks]->fn = maintenance_task_loose_objects;
       	num_tasks++;
       
     -+	tasks[num_tasks]->name = "pack-files";
     -+	tasks[num_tasks]->fn = maintenance_task_pack_files;
     ++	tasks[num_tasks]->name = "incremental-repack";
     ++	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
      +	num_tasks++;
      +
       	tasks[num_tasks]->name = "gc";
     @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
       	test_cmp packs-between packs-after
       '
       
     -+test_expect_success 'pack-files task' '
     ++test_expect_success 'incremental-repack task' '
      +	packDir=.git/objects/pack &&
      +	for i in $(test_seq 1 5)
      +	do
     @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
      +
      +	# the job repacks the two into a new pack, but does not
      +	# delete the old ones.
     -+	git maintenance run --task=pack-files &&
     ++	git maintenance run --task=incremental-repack &&
      +	ls $packDir/*.pack >packs-between &&
      +	test_line_count = 4 packs-between &&
      +
      +	# the job deletes the two old packs, and does not write
      +	# a new one because only one pack remains.
     -+	git maintenance run --task=pack-files &&
     ++	git maintenance run --task=incremental-repack &&
      +	ls .git/objects/pack/*.pack >packs-after &&
      +	test_line_count = 1 packs-after
      +'
 15:  8be89707d2 ! 11:  478c7f1d0b maintenance: auto-size pack-files batch
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    maintenance: auto-size pack-files batch
     +    maintenance: auto-size incremental-repack batch
      
     -    When repacking during the 'pack-files' job, we use the --batch-size
     -    option in 'git multi-pack-index repack'. The initial setting used
     -    --batch-size=0 to repack everything into a single pack-file. This is not
     -    sustaintable for a large repository. The amount of work required is also
     -    likely to use too many system resources for a background job.
     +    When repacking during the 'incremental-repack' task, we use the
     +    --batch-size option in 'git multi-pack-index repack'. The initial setting
     +    used --batch-size=0 to repack everything into a single pack-file. This is
     +    not sustaintable for a large repository. The amount of work required is
     +    also likely to use too many system resources for a background job.
      
     -    Update the 'pack-files' maintenance task by dynamically computing a
     +    Update the 'incremental-repack' task by dynamically computing a
          --batch-size option based on the current pack-file structure.
      
          The dynamic default size is computed with this idea in mind for a client
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: static int multi_pack_index_expire(struct repository *r)
     +@@ builtin/gc.c: static int multi_pack_index_expire(void)
       	return result;
       }
       
      +#define TWO_GIGABYTES (2147483647)
      +#define UNSET_BATCH_SIZE ((unsigned long)-1)
      +
     -+static off_t get_auto_pack_size(struct repository *r)
     ++static off_t get_auto_pack_size(void)
      +{
      +	/*
      +	 * The "auto" value is special: we optimize for
     @@ builtin/gc.c: static int multi_pack_index_expire(struct repository *r)
      +	off_t second_largest_size = 0;
      +	off_t result_size;
      +	struct packed_git *p;
     ++	struct repository *r = the_repository;
      +
      +	reprepare_packed_git(r);
      +	for (p = get_all_packs(r); p; p = p->next) {
     @@ builtin/gc.c: static int multi_pack_index_expire(struct repository *r)
      +	return result_size;
      +}
      +
     - static int multi_pack_index_repack(struct repository *r)
     + static int multi_pack_index_repack(void)
       {
       	int result;
       	struct argv_array cmd = ARGV_ARRAY_INIT;
      +	struct strbuf batch_arg = STRBUF_INIT;
      +
     - 	argv_array_pushl(&cmd, "-C", r->worktree,
     - 			 "multi-pack-index", "repack", NULL);
     + 	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
       
       	if (opts.quiet)
       		argv_array_push(&cmd, "--no-progress");
       
      -	argv_array_push(&cmd, "--batch-size=0");
      +	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
     -+			    (uintmax_t)get_auto_pack_size(r));
     ++		    (uintmax_t)get_auto_pack_size());
      +	argv_array_push(&cmd, batch_arg.buf);
       
     - 	close_object_store(r->objects);
     + 	close_object_store(the_repository->objects);
       	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +	strbuf_release(&batch_arg);
       
     - 	if (result && multi_pack_index_verify(r)) {
     + 	if (result && multi_pack_index_verify()) {
       		warning(_("multi-pack-index verify failed after repack"));
      
       ## t/t7900-maintenance.sh ##
     -@@ t/t7900-maintenance.sh: test_expect_success 'pack-files task' '
     +@@ t/t7900-maintenance.sh: test_expect_success 'incremental-repack task' '
       	test_line_count = 4 packs-between &&
       
       	# the job deletes the two old packs, and does not write
      -	# a new one because only one pack remains.
      +	# a new one because the batch size is not high enough to
      +	# pack the largest pack-file.
     - 	git maintenance run --task=pack-files &&
     + 	git maintenance run --task=incremental-repack &&
       	ls .git/objects/pack/*.pack >packs-after &&
      -	test_line_count = 1 packs-after
      +	test_line_count = 2 packs-after
 16:  1551aec4fd ! 12:  a3c64930a0 maintenance: create maintenance.<task>.enabled config
     @@ Documentation/git-maintenance.txt: SUBCOMMANDS
       -----
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: static int maintenance_run(struct repository *r)
     - 	return result;
     - }
     - 
     --static void initialize_tasks(void)
     -+static void initialize_tasks(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_run(void)
     + static void initialize_tasks(void)
       {
       	int i;
      +	struct strbuf config_name = STRBUF_INIT;
     @@ builtin/gc.c: static int maintenance_run(struct repository *r)
       
       	for (i = 0; i < MAX_NUM_TASKS; i++)
      @@ builtin/gc.c: static void initialize_tasks(void)
     - 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
     - 
     - 	for (i = 0; i < num_tasks; i++) {
     -+		int config_value;
     + 	tasks[num_tasks]->name = "commit-graph";
     + 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
     + 	num_tasks++;
      +
     - 		hashmap_entry_init(&tasks[i]->ent,
     - 				   strihash(tasks[i]->name));
     - 		hashmap_add(&task_map, &tasks[i]->ent);
     ++	for (i = 0; i < num_tasks; i++) {
     ++		int config_value;
      +
      +		strbuf_setlen(&config_name, 0);
      +		strbuf_addf(&config_name, "maintenance.%s.enabled", tasks[i]->name);
      +
     -+		if (!repo_config_get_bool(r, config_name.buf, &config_value))
     ++		if (!git_config_get_bool(config_name.buf, &config_value))
      +			tasks[i]->enabled = config_value;
     - 	}
     ++	}
      +
      +	strbuf_release(&config_name);
       }
       
       static int task_option_parse(const struct option *opt,
     -@@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
     - 				   builtin_maintenance_options);
     - 
     - 	opts.quiet = !isatty(2);
     --	initialize_tasks();
     -+	initialize_tasks(r);
     - 
     - 	argc = parse_options(argc, argv, prefix,
     - 			     builtin_maintenance_options,
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'run [--auto|--quiet]' '
 17:  130130b662 ! 13:  dbacc2b76c maintenance: use pointers to check --auto
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: static int maintenance_task_pack_files(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_task_incremental_repack(void)
       
     - typedef int maintenance_task_fn(struct repository *r);
     + typedef int maintenance_task_fn(void);
       
      +/*
      + * An auto condition function returns 1 if the task should run
      + * and 0 if the task should NOT run. See needs_to_gc() for an
      + * example.
      + */
     -+typedef int maintenance_auto_fn(struct repository *r);
     ++typedef int maintenance_auto_fn(void);
      +
       struct maintenance_task {
     - 	struct hashmap_entry ent;
       	const char *name;
       	maintenance_task_fn *fn;
      +	maintenance_auto_fn *auto_condition;
       	int task_order;
       	unsigned enabled:1,
       		 selected:1;
     -@@ builtin/gc.c: static int maintenance_run(struct repository *r)
     +@@ builtin/gc.c: static int maintenance_run(void)
       		if (!opts.tasks_selected && !tasks[i]->enabled)
       			continue;
       
      +		if (opts.auto_flag &&
      +		    (!tasks[i]->auto_condition ||
     -+		     !tasks[i]->auto_condition(r)))
     ++		     !tasks[i]->auto_condition()))
      +			continue;
      +
     - 		result = tasks[i]->fn(r);
     + 		result = tasks[i]->fn();
       	}
       
     -@@ builtin/gc.c: static void initialize_tasks(struct repository *r)
     +@@ builtin/gc.c: static void initialize_tasks(void)
       
       	tasks[num_tasks]->name = "gc";
       	tasks[num_tasks]->fn = maintenance_task_gc;
     @@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefi
       				   builtin_maintenance_options);
       
       	opts.quiet = !isatty(2);
     -+	gc_config(r);
     - 	initialize_tasks(r);
     ++	gc_config();
     + 	initialize_tasks();
       
       	argc = parse_options(argc, argv, prefix,
      
 18:  37fa3f6157 ! 14:  9af2309f08 maintenance: add auto condition for commit-graph task
     @@ builtin/gc.c
       
       #define FAILED_RUN "failed to run %s"
       
     -@@ builtin/gc.c: struct maintenance_opts {
     +@@ builtin/gc.c: static struct maintenance_opts {
       	int tasks_selected;
       } opts;
       
     @@ builtin/gc.c: struct maintenance_opts {
      +	return result;
      +}
      +
     -+static int should_write_commit_graph(struct repository *r)
     ++static int should_write_commit_graph(void)
      +{
      +	int result;
      +
     -+	repo_config_get_int(r, "maintenance.commit-graph.auto",
     -+			    &limit_commits_not_in_graph);
     ++	git_config_get_int("maintenance.commit-graph.auto",
     ++			   &limit_commits_not_in_graph);
      +
      +	if (!limit_commits_not_in_graph)
      +		return 0;
     @@ builtin/gc.c: struct maintenance_opts {
      +	return result;
      +}
      +
     - static int run_write_commit_graph(struct repository *r)
     + static int run_write_commit_graph(void)
       {
       	int result;
     -@@ builtin/gc.c: static void initialize_tasks(struct repository *r)
     +@@ builtin/gc.c: static void initialize_tasks(void)
       
       	tasks[num_tasks]->name = "commit-graph";
       	tasks[num_tasks]->fn = maintenance_task_commit_graph;
      +	tasks[num_tasks]->auto_condition = should_write_commit_graph;
       	num_tasks++;
       
     - 	hashmap_init(&task_map, task_entry_cmp, NULL, MAX_NUM_TASKS);
     + 	for (i = 0; i < num_tasks; i++) {
      
       ## object.h ##
      @@ object.h: struct object_array {
 19:  4744fdaae9 ! 15:  42e316ca58 maintenance: create auto condition for loose-objects
     @@ builtin/gc.c: struct write_loose_object_data {
      +	return 0;
      +}
      +
     -+static int loose_object_auto_condition(struct repository *r)
     ++static int loose_object_auto_condition(void)
      +{
      +	int count = 0;
      +
     -+	repo_config_get_int(r, "maintenance.loose-objects.auto",
     -+			    &loose_object_auto_limit);
     ++	git_config_get_int("maintenance.loose-objects.auto",
     ++			   &loose_object_auto_limit);
      +
      +	if (!loose_object_auto_limit)
      +		return 0;
      +	if (loose_object_auto_limit < 0)
      +		return 1;
      +
     -+	return for_each_loose_file_in_objdir(r->objects->odb->path,
     ++	return for_each_loose_file_in_objdir(the_repository->objects->odb->path,
      +					     loose_object_count,
      +					     NULL, NULL, &count);
      +}
     @@ builtin/gc.c: struct write_loose_object_data {
       static int loose_object_exists(const struct object_id *oid,
       			       const char *path,
       			       void *data)
     -@@ builtin/gc.c: static void initialize_tasks(struct repository *r)
     +@@ builtin/gc.c: static void initialize_tasks(void)
       
       	tasks[num_tasks]->name = "loose-objects";
       	tasks[num_tasks]->fn = maintenance_task_loose_objects;
      +	tasks[num_tasks]->auto_condition = loose_object_auto_condition;
       	num_tasks++;
       
     - 	tasks[num_tasks]->name = "pack-files";
     + 	tasks[num_tasks]->name = "incremental-repack";
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
     @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
      +	done
      +'
      +
     - test_expect_success 'pack-files task' '
     + test_expect_success 'incremental-repack task' '
       	packDir=.git/objects/pack &&
       	for i in $(test_seq 1 5)
 20:  fbe03b9af9 ! 16:  3d527cb0dd maintenance: add pack-files auto condition
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    maintenance: add pack-files auto condition
     +    maintenance: add incremental-repack auto condition
      
     -    The pack-files task updates the multi-pack-index by deleting pack-files
     -    that have been replaced with new packs, then repacking a batch of small
     -    pack-files into a larger pack-file. This incremental repack is faster
     +    The incremental-repack task updates the multi-pack-index by deleting pack-
     +    files that have been replaced with new packs, then repacking a batch of
     +    small pack-files into a larger pack-file. This incremental repack is faster
          than rewriting all object data, but is slower than some other
          maintenance activities.
      
     -    The 'maintenance.pack-files.auto' config option specifies how many
     +    The 'maintenance.incremental-repack.auto' config option specifies how many
          pack-files should exist outside of the multi-pack-index before running
          the step. These pack-files could be created by 'git fetch' commands or
          by the loose-objects task. The default value is 10.
     @@ Documentation/config/maintenance.txt: maintenance.loose-objects.auto::
       	loose objects is at least the value of `maintenance.loose-objects.auto`.
       	The default value is 100.
      +
     -+maintenance.pack-files.auto::
     -+	This integer config option controls how often the `pack-files` task
     -+	should be run as part of `git maintenance run --auto`. If zero, then
     -+	the `pack-files` task will not run with the `--auto` option. A
     -+	negative value will force the task to run every time. Otherwise, a
     -+	positive value implies the command should run when the number of
     -+	pack-files not in the multi-pack-index is at least the value of
     -+	`maintenance.pack-files.auto`. The default value is 10.
     ++maintenance.incremental-repack.auto::
     ++	This integer config option controls how often the `incremental-repack`
     ++	task should be run as part of `git maintenance run --auto`. If zero,
     ++	then the `incremental-repack` task will not run with the `--auto`
     ++	option. A negative value will force the task to run every time.
     ++	Otherwise, a positive value implies the command should run when the
     ++	number of pack-files not in the multi-pack-index is at least the value
     ++	of `maintenance.incremental-repack.auto`. The default value is 10.
      
       ## builtin/gc.c ##
      @@
     @@ builtin/gc.c
       
       #define FAILED_RUN "failed to run %s"
       
     -@@ builtin/gc.c: static int maintenance_task_loose_objects(struct repository *r)
     - 	return prune_packed(r) || pack_loose(r);
     +@@ builtin/gc.c: static int maintenance_task_loose_objects(void)
     + 	return prune_packed() || pack_loose();
       }
       
     -+static int pack_files_auto_condition(struct repository *r)
     ++static int incremental_repack_auto_condition(void)
      +{
      +	struct packed_git *p;
      +	int enabled;
     -+	int pack_files_auto_limit = 10;
     ++	int incremental_repack_auto_limit = 10;
      +	int count = 0;
      +
     -+	if (repo_config_get_bool(r, "core.multiPackIndex", &enabled) ||
     ++	if (git_config_get_bool("core.multiPackIndex", &enabled) ||
      +	    !enabled)
      +		return 0;
      +
     -+	repo_config_get_int(r, "maintenance.pack-files.auto",
     -+			    &pack_files_auto_limit);
     ++	git_config_get_int("maintenance.incremental-repack.auto",
     ++			   &incremental_repack_auto_limit);
      +
     -+	if (!pack_files_auto_limit)
     ++	if (!incremental_repack_auto_limit)
      +		return 0;
     -+	if (pack_files_auto_limit < 0)
     ++	if (incremental_repack_auto_limit < 0)
      +		return 1;
      +
     -+	for (p = get_packed_git(r);
     -+	     count < pack_files_auto_limit && p;
     ++	for (p = get_packed_git(the_repository);
     ++	     count < incremental_repack_auto_limit && p;
      +	     p = p->next) {
      +		if (!p->multi_pack_index)
      +			count++;
      +	}
      +
     -+	return count >= pack_files_auto_limit;
     ++	return count >= incremental_repack_auto_limit;
      +}
      +
     - static int multi_pack_index_write(struct repository *r)
     + static int multi_pack_index_write(void)
       {
       	int result;
     -@@ builtin/gc.c: static void initialize_tasks(struct repository *r)
     +@@ builtin/gc.c: static void initialize_tasks(void)
       
     - 	tasks[num_tasks]->name = "pack-files";
     - 	tasks[num_tasks]->fn = maintenance_task_pack_files;
     -+	tasks[num_tasks]->auto_condition = pack_files_auto_condition;
     + 	tasks[num_tasks]->name = "incremental-repack";
     + 	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
     ++	tasks[num_tasks]->auto_condition = incremental_repack_auto_condition;
       	num_tasks++;
       
       	tasks[num_tasks]->name = "gc";
      
       ## t/t7900-maintenance.sh ##
     -@@ t/t7900-maintenance.sh: test_expect_success 'pack-files task' '
     +@@ t/t7900-maintenance.sh: test_expect_success 'incremental-repack task' '
       	test_line_count = 2 packs-after
       '
       
     -+test_expect_success 'maintenance.pack-files.auto' '
     ++test_expect_success 'maintenance.incremental-repack.auto' '
      +	git repack -adk &&
      +	git config core.multiPackIndex true &&
      +	git multi-pack-index write &&
     -+	GIT_TRACE2_EVENT=1 git -c maintenance.pack-files.auto=1 maintenance \
     -+		run --auto --task=pack-files >out &&
     ++	GIT_TRACE2_EVENT=1 git -c maintenance.incremental-repack.auto=1 \
     ++		maintenance run --auto --task=incremental-repack >out &&
      +	! grep "\"multi-pack-index\"" out &&
      +	for i in 1 2
      +	do
     @@ t/t7900-maintenance.sh: test_expect_success 'pack-files task' '
      +		^HEAD~1
      +		EOF
      +		GIT_TRACE2_EVENT=$(pwd)/trace-A-$i git \
     -+			-c maintenance.pack-files.auto=2 \
     -+			maintenance run --auto --task=pack-files &&
     ++			-c maintenance.incremental-repack.auto=2 \
     ++			maintenance run --auto --task=incremental-repack &&
      +		! grep "\"multi-pack-index\"" trace-A-$i &&
      +		test_commit B-$i &&
      +		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
     @@ t/t7900-maintenance.sh: test_expect_success 'pack-files task' '
      +		^HEAD~1
      +		EOF
      +		GIT_TRACE2_EVENT=$(pwd)/trace-B-$i git \
     -+			-c maintenance.pack-files.auto=2 \
     -+			maintenance run --auto --task=pack-files >out &&
     ++			-c maintenance.incremental-repack.auto=2 \
     ++			maintenance run --auto --task=incremental-repack >out &&
      +		grep "\"multi-pack-index\"" trace-B-$i >/dev/null || return 1
      +	done
      +'
 21:  f7fdf72e9d = 17:  a0f00f8ab8 midx: use start_delayed_progress()
  -:  ---------- > 18:  f24db7739f maintenance: add trace2 regions for task execution

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 01/18] maintenance: create basic maintenance runner
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-25  1:26     ` Taylor Blau
                       ` (2 more replies)
  2020-07-23 17:56   ` [PATCH v2 02/18] maintenance: add --quiet option Derrick Stolee via GitGitGadget
                     ` (18 subsequent siblings)
  19 siblings, 3 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'gc' builtin is our current entrypoint for automatically maintaining
a repository. This one tool does many operations, such as repacking the
repository, packing refs, and rewriting the commit-graph file. The name
implies it performs "garbage collection" which means several different
things, and some users may not want to use this operation that rewrites
the entire object database.

Create a new 'maintenance' builtin that will become a more general-
purpose command. To start, it will only support the 'run' subcommand,
but will later expand to add subcommands for scheduling maintenance in
the background.

For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin.
In fact, the only option is the '--auto' toggle, which is handed
directly to the 'gc' builtin. The current change is isolated to this
simple operation to prevent more interesting logic from being lost in
all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the
two builtins. It is possible that we will have 'maintenance' replace the
'gc' builtin entirely at some point, leaving 'git gc' as an alias for
some specific arguments to 'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                        |  1 +
 Documentation/git-maintenance.txt | 57 +++++++++++++++++++++++++++++
 builtin.h                         |  1 +
 builtin/gc.c                      | 59 +++++++++++++++++++++++++++++++
 git.c                             |  1 +
 t/t7900-maintenance.sh            | 22 ++++++++++++
 6 files changed, 141 insertions(+)
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh

diff --git a/.gitignore b/.gitignore
index ee509a2ad2..a5808fa30d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -90,6 +90,7 @@
 /git-ls-tree
 /git-mailinfo
 /git-mailsplit
+/git-maintenance
 /git-merge
 /git-merge-base
 /git-merge-index
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
new file mode 100644
index 0000000000..34cd2b4417
--- /dev/null
+++ b/Documentation/git-maintenance.txt
@@ -0,0 +1,57 @@
+git-maintenance(1)
+==================
+
+NAME
+----
+git-maintenance - Run tasks to optimize Git repository data
+
+
+SYNOPSIS
+--------
+[verse]
+'git maintenance' run [<options>]
+
+
+DESCRIPTION
+-----------
+Run tasks to optimize Git repository data, speeding up other Git commands
+and reducing storage requirements for the repository.
++
+Git commands that add repository data, such as `git add` or `git fetch`,
+are optimized for a responsive user experience. These commands do not take
+time to optimize the Git data, since such optimizations scale with the full
+size of the repository while these user commands each perform a relatively
+small action.
++
+The `git maintenance` command provides flexibility for how to optimize the
+Git repository.
+
+SUBCOMMANDS
+-----------
+
+run::
+	Run one or more maintenance tasks.
+
+TASKS
+-----
+
+gc::
+	Cleanup unnecessary files and optimize the local repository. "GC"
+	stands for "garbage collection," but this task performs many
+	smaller tasks. This task can be rather expensive for large
+	repositories, as it repacks all Git objects into a single pack-file.
+	It can also be disruptive in some situations, as it deletes stale
+	data.
+
+OPTIONS
+-------
+--auto::
+	When combined with the `run` subcommand, run maintenance tasks
+	only if certain thresholds are met. For example, the `gc` task
+	runs when the number of loose objects exceeds the number stored
+	in the `gc.auto` config setting, or when the number of pack-files
+	exceeds the `gc.autoPackLimit` config setting.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/builtin.h b/builtin.h
index a5ae15bfe5..17c1c0ce49 100644
--- a/builtin.h
+++ b/builtin.h
@@ -167,6 +167,7 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix);
 int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 int cmd_mailsplit(int argc, const char **argv, const char *prefix);
+int cmd_maintenance(int argc, const char **argv, const char *prefix);
 int cmd_merge(int argc, const char **argv, const char *prefix);
 int cmd_merge_base(int argc, const char **argv, const char *prefix);
 int cmd_merge_index(int argc, const char **argv, const char *prefix);
diff --git a/builtin/gc.c b/builtin/gc.c
index 8e0b9cf41b..8d73c77f3a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	return 0;
 }
+
+static const char * const builtin_maintenance_usage[] = {
+	N_("git maintenance run [<options>]"),
+	NULL
+};
+
+static struct maintenance_opts {
+	int auto_flag;
+} opts;
+
+static int maintenance_task_gc(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "gc", NULL);
+
+	if (opts.auto_flag)
+		argv_array_pushl(&cmd, "--auto", NULL);
+
+	close_object_store(the_repository->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int maintenance_run(void)
+{
+	return maintenance_task_gc();
+}
+
+int cmd_maintenance(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_maintenance_options[] = {
+		OPT_BOOL(0, "auto", &opts.auto_flag,
+			 N_("run tasks based on the state of the repository")),
+		OPT_END()
+	};
+
+	memset(&opts, 0, sizeof(opts));
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_maintenance_usage,
+				   builtin_maintenance_options);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_maintenance_options,
+			     builtin_maintenance_usage,
+			     PARSE_OPT_KEEP_UNKNOWN);
+
+	if (argc == 1) {
+		if (!strcmp(argv[0], "run"))
+			return maintenance_run();
+	}
+
+	usage_with_options(builtin_maintenance_usage,
+			   builtin_maintenance_options);
+}
diff --git a/git.c b/git.c
index 2f021b97f3..ff56d1df24 100644
--- a/git.c
+++ b/git.c
@@ -527,6 +527,7 @@ static struct cmd_struct commands[] = {
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "mailsplit", cmd_mailsplit, NO_PARSEOPT },
+	{ "maintenance", cmd_maintenance, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "merge", cmd_merge, RUN_SETUP | NEED_WORK_TREE },
 	{ "merge-base", cmd_merge_base, RUN_SETUP },
 	{ "merge-file", cmd_merge_file, RUN_SETUP_GENTLY },
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
new file mode 100755
index 0000000000..d00641c4dd
--- /dev/null
+++ b/t/t7900-maintenance.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+test_description='git maintenance builtin'
+
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
+
+. ./test-lib.sh
+
+test_expect_success 'help text' '
+	test_must_fail git maintenance -h 2>err &&
+	test_i18ngrep "usage: git maintenance run" err
+'
+
+test_expect_success 'gc [--auto]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	grep ",\"gc\"]" run-no-auto.txt  &&
+	grep ",\"gc\",\"--auto\"]" run-auto.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 02/18] maintenance: add --quiet option
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 03/18] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
                     ` (17 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Maintenance activities are commonly used as steps in larger scripts.
Providing a '--quiet' option allows those scripts to be less noisy when
run on a terminal window. Turn this mode on by default when stderr is
not a terminal.

Pipe the option to the 'git gc' child process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 3 +++
 builtin/gc.c                      | 7 +++++++
 t/t7900-maintenance.sh            | 8 +++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 34cd2b4417..089fa4cedc 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -52,6 +52,9 @@ OPTIONS
 	in the `gc.auto` config setting, or when the number of pack-files
 	exceeds the `gc.autoPackLimit` config setting.
 
+--quiet::
+	Do not report progress or other information over `stderr`.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index 8d73c77f3a..c8cde28436 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -707,6 +707,7 @@ static const char * const builtin_maintenance_usage[] = {
 
 static struct maintenance_opts {
 	int auto_flag;
+	int quiet;
 } opts;
 
 static int maintenance_task_gc(void)
@@ -718,6 +719,8 @@ static int maintenance_task_gc(void)
 
 	if (opts.auto_flag)
 		argv_array_pushl(&cmd, "--auto", NULL);
+	if (opts.quiet)
+		argv_array_pushl(&cmd, "--quiet", NULL);
 
 	close_object_store(the_repository->objects);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
@@ -736,6 +739,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 	static struct option builtin_maintenance_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
+		OPT_BOOL(0, "quiet", &opts.quiet,
+			 N_("do not report progress or other information over stderr")),
 		OPT_END()
 	};
 
@@ -745,6 +750,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_maintenance_usage,
 				   builtin_maintenance_options);
 
+	opts.quiet = !isatty(2);
+
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
 			     builtin_maintenance_usage,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index d00641c4dd..e4e4036e50 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -12,11 +12,13 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git maintenance run" err
 '
 
-test_expect_success 'gc [--auto]' '
-	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+test_expect_success 'gc [--auto|--quiet]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
 	grep ",\"gc\"]" run-no-auto.txt  &&
-	grep ",\"gc\",\"--auto\"]" run-auto.txt
+	grep ",\"gc\",\"--auto\"" run-auto.txt &&
+	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 02/18] maintenance: add --quiet option Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 20:21     ` Junio C Hamano
  2020-07-23 17:56   ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
                     ` (16 subsequent siblings)
  19 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The run_auto_gc() method is used in several places to trigger a check
for repo maintenance after some Git commands, such as 'git commit' or
'git fetch'.

To allow for extra customization of this maintenance activity, replace
the 'git gc --auto [--quiet]' call with one to 'git maintenance run
--auto [--quiet]'. As we extend the maintenance builtin with other
steps, users will be able to select different maintenance activities.

Rename run_auto_gc() to run_auto_maintenance() to be clearer what is
happening on this call, and to expose all callers in the current diff.

Since 'git fetch' already allows disabling the 'git gc --auto'
subprocess, add an equivalent option with a different name to be more
descriptive of the new behavior: '--[no-]maintenance'. Update the
documentation to include these options at the same time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/fetch-options.txt | 5 +++--
 Documentation/git-clone.txt     | 7 ++++---
 builtin/am.c                    | 2 +-
 builtin/commit.c                | 2 +-
 builtin/fetch.c                 | 6 ++++--
 builtin/merge.c                 | 2 +-
 builtin/rebase.c                | 4 ++--
 run-command.c                   | 7 +++++--
 run-command.h                   | 2 +-
 t/t5510-fetch.sh                | 2 +-
 10 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 6e2a160a47..d73224844e 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -86,9 +86,10 @@ ifndef::git-pull[]
 	Allow several <repository> and <group> arguments to be
 	specified. No <refspec>s may be specified.
 
+--[no-]maintenance::
 --[no-]auto-gc::
-	Run `git gc --auto` at the end to perform garbage collection
-	if needed. This is enabled by default.
+	Run `git maintenance run --auto` at the end to perform garbage
+	collection if needed. This is enabled by default.
 
 --[no-]write-commit-graph::
 	Write a commit-graph after fetching. This overrides the config
diff --git a/Documentation/git-clone.txt b/Documentation/git-clone.txt
index c898310099..aa25aba7d9 100644
--- a/Documentation/git-clone.txt
+++ b/Documentation/git-clone.txt
@@ -78,9 +78,10 @@ repository using this option and then delete branches (or use any
 other Git command that makes any existing commit unreferenced) in the
 source repository, some objects may become unreferenced (or dangling).
 These objects may be removed by normal Git operations (such as `git commit`)
-which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
-If these objects are removed and were referenced by the cloned repository,
-then the cloned repository will become corrupt.
+which automatically call `git maintenance run --auto` and `git gc --auto`.
+(See linkgit:git-maintenance[1] and linkgit:git-gc[1].) If these objects
+are removed and were referenced by the cloned repository, then the cloned
+repository will become corrupt.
 +
 Note that running `git repack` without the `--local` option in a repository
 cloned with `--shared` will copy objects from the source repository into a pack
diff --git a/builtin/am.c b/builtin/am.c
index 69e50de018..ff895125f6 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -1795,7 +1795,7 @@ static void am_run(struct am_state *state, int resume)
 	if (!state->rebasing) {
 		am_destroy(state);
 		close_object_store(the_repository->objects);
-		run_auto_gc(state->quiet);
+		run_auto_maintenance(state->quiet);
 	}
 }
 
diff --git a/builtin/commit.c b/builtin/commit.c
index d1b7396052..658b158659 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1702,7 +1702,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 	git_test_write_commit_graph_or_die();
 
 	repo_rerere(the_repository, 0);
-	run_auto_gc(quiet);
+	run_auto_maintenance(quiet);
 	run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
 	if (amend && !no_post_rewrite) {
 		commit_post_rewrite(the_repository, current_head, &oid);
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 82ac4be8a5..49a4d727d4 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
 	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
 			N_("report that we have only objects reachable from this object")),
 	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
+	OPT_BOOL(0, "maintenance", &enable_auto_gc,
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
-		 N_("run 'gc --auto' after fetching")),
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "show-forced-updates", &fetch_show_forced_updates,
 		 N_("check for forced-updates on all updated branches")),
 	OPT_BOOL(0, "write-commit-graph", &fetch_write_commit_graph,
@@ -1882,7 +1884,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	close_object_store(the_repository->objects);
 
 	if (enable_auto_gc)
-		run_auto_gc(verbosity < 0);
+		run_auto_maintenance(verbosity < 0);
 
 	return result;
 }
diff --git a/builtin/merge.c b/builtin/merge.c
index 7da707bf55..c068e73037 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -457,7 +457,7 @@ static void finish(struct commit *head_commit,
 			 * user should see them.
 			 */
 			close_object_store(the_repository->objects);
-			run_auto_gc(verbosity < 0);
+			run_auto_maintenance(verbosity < 0);
 		}
 	}
 	if (new_head && show_diffstat) {
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 37ba76ac3d..0c4ee98f08 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -728,10 +728,10 @@ static int finish_rebase(struct rebase_options *opts)
 	apply_autostash(state_dir_path("autostash", opts));
 	close_object_store(the_repository->objects);
 	/*
-	 * We ignore errors in 'gc --auto', since the
+	 * We ignore errors in 'git maintenance run --auto', since the
 	 * user should see them.
 	 */
-	run_auto_gc(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
+	run_auto_maintenance(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
 	if (opts->type == REBASE_MERGE) {
 		struct replay_opts replay = REPLAY_OPTS_INIT;
 
diff --git a/run-command.c b/run-command.c
index 9b3a57d1e3..82ad241638 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
 	return result;
 }
 
-int run_auto_gc(int quiet)
+int run_auto_maintenance(int quiet)
 {
 	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
 	int status;
 
-	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
+	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
 	if (quiet)
 		argv_array_push(&argv_gc_auto, "--quiet");
+	else
+		argv_array_push(&argv_gc_auto, "--no-quiet");
+
 	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
 	argv_array_clear(&argv_gc_auto);
 	return status;
diff --git a/run-command.h b/run-command.h
index 191dfcdafe..d9a800e700 100644
--- a/run-command.h
+++ b/run-command.h
@@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
 /*
  * Trigger an auto-gc
  */
-int run_auto_gc(int quiet);
+int run_auto_maintenance(int quiet);
 
 #define RUN_COMMAND_NO_STDIN 1
 #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index a66dbe0bde..9850ecde5d 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
 		git config fetch.unpackLimit 1 &&
 		git config gc.autoPackLimit 1 &&
 		git config gc.autoDetach false &&
-		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
+		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
 		test_i18ngrep "Auto packing the repository" fetch.out &&
 		! grep "Should I try again" fetch.out
 	)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 04/18] maintenance: initialize task array
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 03/18] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 19:57     ` Junio C Hamano
  2020-07-29 22:19     ` Emily Shaffer
  2020-07-23 17:56   ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
                     ` (15 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of implementing multiple maintenance tasks inside the
'maintenance' builtin, use a list of structs to describe the work to be
done.

The struct maintenance_task stores the name of the task (as given by a
future command-line argument) along with a function pointer to its
implementation and a boolean for whether the step is enabled.

A list of pointers to these structs are initialized with the full list
of implemented tasks along with a default order. For now, this list only
contains the "gc" task. This task is also the only task enabled by
default.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c8cde28436..c28fb0b16d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -700,6 +700,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
+#define MAX_NUM_TASKS 1
+
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
 	NULL
@@ -729,9 +731,43 @@ static int maintenance_task_gc(void)
 	return result;
 }
 
+typedef int maintenance_task_fn(void);
+
+struct maintenance_task {
+	const char *name;
+	maintenance_task_fn *fn;
+	unsigned enabled:1;
+};
+
+static struct maintenance_task *tasks[MAX_NUM_TASKS];
+static int num_tasks;
+
 static int maintenance_run(void)
 {
-	return maintenance_task_gc();
+	int i;
+	int result = 0;
+
+	for (i = 0; !result && i < num_tasks; i++) {
+		if (!tasks[i]->enabled)
+			continue;
+		result = tasks[i]->fn();
+	}
+
+	return result;
+}
+
+static void initialize_tasks(void)
+{
+	int i;
+	num_tasks = 0;
+
+	for (i = 0; i < MAX_NUM_TASKS; i++)
+		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
+
+	tasks[num_tasks]->name = "gc";
+	tasks[num_tasks]->fn = maintenance_task_gc;
+	tasks[num_tasks]->enabled = 1;
+	num_tasks++;
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
@@ -751,6 +787,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 				   builtin_maintenance_options);
 
 	opts.quiet = !isatty(2);
+	initialize_tasks();
 
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 20:22     ` Junio C Hamano
  2020-07-29  0:22     ` Jeff King
  2020-07-23 17:56   ` [PATCH v2 06/18] maintenance: add --task option Derrick Stolee via GitGitGadget
                     ` (14 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The first new task in the 'git maintenance' builtin is the
'commit-graph' job. It is based on the sequence of events in the
'commit-graph' job in Scalar [1]. This sequence is as follows:

1. git commit-graph write --reachable --split
2. git commit-graph verify --shallow
3. If the verify succeeds, stop.
4. Delete the commit-graph-chain file.
5. git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

By using 'git commit-graph verify --shallow' we can ensure that
the file we just wrote is valid. This is an extra safety precaution
that is faster than our 'write' subcommand. In the rare situation
that the newest layer of the commit-graph is corrupt, we can "fix"
the corruption by deleting the commit-graph-chain file and rewrite
the full commit-graph as a new one-layer commit graph. This does
not completely prevent _that_ file from being corrupt, but it does
recompute the commit-graph by parsing commits from the object
database. In our use of this step in Scalar and VFS for Git, we
have only seen this issue arise because our microsoft/git fork
reverted 43d3561 ("commit-graph write: don't die if the existing
graph is corrupt" 2019-03-25) for a while to keep commit-graph
writes very fast. We dropped the revert when updating to v2.23.0.
The verify still has potential for catching corrupt data across
the layer boundary: if the new file has commit X with parent Y
in an old file but the commit ID for Y in the old file had a
bitswap, then we will notice that in the 'verify' command.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 18 ++++++++
 builtin/gc.c                      | 74 ++++++++++++++++++++++++++++++-
 commit-graph.c                    |  8 ++--
 commit-graph.h                    |  1 +
 t/t7900-maintenance.sh            |  2 +-
 5 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 089fa4cedc..35b0be7d40 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -35,6 +35,24 @@ run::
 TASKS
 -----
 
+commit-graph::
+	The `commit-graph` job updates the `commit-graph` files incrementally,
+	then verifies that the written data is correct. If the new layer has an
+	issue, then the chain file is removed and the `commit-graph` is
+	rewritten from scratch.
++
+The verification only checks the top layer of the `commit-graph` chain.
+If the incremental write merged the new commits with at least one
+existing layer, then there is potential for on-disk corruption being
+carried forward into the new file. This will be noticed and the new
+commit-graph file will be clean as Git reparses the commit data from
+the object database.
++
+The incremental write is safe to run alongside concurrent Git processes
+since it will not expire `.graph` files that were in the previous
+`commit-graph-chain` file. They will be deleted by a later run based on
+the expiration delay.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index c28fb0b16d..2cd17398ec 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -700,7 +700,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 1
+#define MAX_NUM_TASKS 2
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -712,6 +712,74 @@ static struct maintenance_opts {
 	int quiet;
 } opts;
 
+static int run_write_commit_graph(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "commit-graph", "write",
+			 "--split", "--reachable", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int run_verify_commit_graph(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+
+	argv_array_pushl(&cmd, "commit-graph", "verify",
+			 "--shallow", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int maintenance_task_commit_graph(void)
+{
+	struct repository *r = the_repository;
+	char *chain_path;
+
+	/* Skip commit-graph when --auto is specified. */
+	if (opts.auto_flag)
+		return 0;
+
+	close_object_store(r->objects);
+	if (run_write_commit_graph()) {
+		error(_("failed to write commit-graph"));
+		return 1;
+	}
+
+	if (!run_verify_commit_graph())
+		return 0;
+
+	warning(_("commit-graph verify caught error, rewriting"));
+
+	chain_path = get_commit_graph_chain_filename(r->objects->odb);
+	if (unlink(chain_path)) {
+		UNLEAK(chain_path);
+		die(_("failed to remove commit-graph at %s"), chain_path);
+	}
+	free(chain_path);
+
+	if (!run_write_commit_graph())
+		return 0;
+
+	error(_("failed to rewrite commit-graph"));
+	return 1;
+}
+
 static int maintenance_task_gc(void)
 {
 	int result;
@@ -768,6 +836,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
 	num_tasks++;
+
+	tasks[num_tasks]->name = "commit-graph";
+	tasks[num_tasks]->fn = maintenance_task_commit_graph;
+	num_tasks++;
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
diff --git a/commit-graph.c b/commit-graph.c
index fdd1c4fa7c..57278a9ab5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -172,7 +172,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
 		       oid_hex);
 }
 
-static char *get_chain_filename(struct object_directory *odb)
+char *get_commit_graph_chain_filename(struct object_directory *odb)
 {
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
@@ -520,7 +520,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	struct stat st;
 	struct object_id *oids;
 	int i = 0, valid = 1, count;
-	char *chain_name = get_chain_filename(odb);
+	char *chain_name = get_commit_graph_chain_filename(odb);
 	FILE *fp;
 	int stat_res;
 
@@ -1635,7 +1635,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	if (ctx->split) {
-		char *lock_name = get_chain_filename(ctx->odb);
+		char *lock_name = get_commit_graph_chain_filename(ctx->odb);
 
 		hold_lock_file_for_update_mode(&lk, lock_name,
 					       LOCK_DIE_ON_ERROR, 0444);
@@ -2012,7 +2012,7 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	if (ctx->split_opts && ctx->split_opts->expire_time)
 		expire_time = ctx->split_opts->expire_time;
 	if (!ctx->split) {
-		char *chain_file_name = get_chain_filename(ctx->odb);
+		char *chain_file_name = get_commit_graph_chain_filename(ctx->odb);
 		unlink(chain_file_name);
 		free(chain_file_name);
 		ctx->num_commit_graphs_after = 0;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..3c202748c3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -25,6 +25,7 @@ struct commit;
 struct bloom_filter_settings;
 
 char *get_commit_graph_filename(struct object_directory *odb);
+char *get_commit_graph_chain_filename(struct object_directory *odb);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
 
 /*
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index e4e4036e50..216ac0b19e 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -12,7 +12,7 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git maintenance run" err
 '
 
-test_expect_success 'gc [--auto|--quiet]' '
+test_expect_success 'run [--auto|--quiet]' '
 	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
 	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 06/18] maintenance: add --task option
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 20:21     ` Junio C Hamano
  2020-07-23 17:56   ` [PATCH v2 07/18] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
                     ` (13 subsequent siblings)
  19 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A user may want to only run certain maintenance tasks in a certain
order. Add the --task=<task> option, which allows a user to specify an
ordered list of tasks to run. These cannot be run multiple times,
however.

Here is where our array of maintenance_task pointers becomes critical.
We can sort the array of pointers based on the task order, but we do not
want to move the struct data itself in order to preserve the hashmap
references. We use the hashmap to match the --task=<task> arguments into
the task struct data.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  4 ++
 builtin/gc.c                      | 64 ++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            | 23 +++++++++++
 3 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 35b0be7d40..9204762e21 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,10 @@ OPTIONS
 --quiet::
 	Do not report progress or other information over `stderr`.
 
+--task=<task>::
+	If this option is specified one or more times, then only run the
+	specified tasks in the specified order.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index 2cd17398ec..c58dea6fa5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -710,6 +710,7 @@ static const char * const builtin_maintenance_usage[] = {
 static struct maintenance_opts {
 	int auto_flag;
 	int quiet;
+	int tasks_selected;
 } opts;
 
 static int run_write_commit_graph(void)
@@ -804,20 +805,38 @@ typedef int maintenance_task_fn(void);
 struct maintenance_task {
 	const char *name;
 	maintenance_task_fn *fn;
-	unsigned enabled:1;
+	int task_order;
+	unsigned enabled:1,
+		 selected:1;
 };
 
 static struct maintenance_task *tasks[MAX_NUM_TASKS];
 static int num_tasks;
 
+static int compare_tasks_by_selection(const void *a_, const void *b_)
+{
+	const struct maintenance_task *a, *b;
+	a = (const struct maintenance_task *)a_;
+	b = (const struct maintenance_task *)b_;
+
+	return b->task_order - a->task_order;
+}
+
 static int maintenance_run(void)
 {
 	int i;
 	int result = 0;
 
+	if (opts.tasks_selected)
+		QSORT(tasks, num_tasks, compare_tasks_by_selection);
+
 	for (i = 0; !result && i < num_tasks; i++) {
-		if (!tasks[i]->enabled)
+		if (opts.tasks_selected && !tasks[i]->selected)
+			continue;
+
+		if (!opts.tasks_selected && !tasks[i]->enabled)
 			continue;
+
 		result = tasks[i]->fn();
 	}
 
@@ -842,6 +861,44 @@ static void initialize_tasks(void)
 	num_tasks++;
 }
 
+static int task_option_parse(const struct option *opt,
+			     const char *arg, int unset)
+{
+	int i;
+	struct maintenance_task *task = NULL;
+
+	BUG_ON_OPT_NEG(unset);
+
+	if (!arg || !strlen(arg)) {
+		error(_("--task requires a value"));
+		return 1;
+	}
+
+	opts.tasks_selected++;
+
+	for (i = 0; i < MAX_NUM_TASKS; i++) {
+		if (tasks[i] && !strcasecmp(tasks[i]->name, arg)) {
+			task = tasks[i];
+			break;
+		}
+	}
+
+	if (!task) {
+		error(_("'%s' is not a valid task"), arg);
+		return 1;
+	}
+
+	if (task->selected) {
+		error(_("task '%s' cannot be selected multiple times"), arg);
+		return 1;
+	}
+
+	task->selected = 1;
+	task->task_order = opts.tasks_selected;
+
+	return 0;
+}
+
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_maintenance_options[] = {
@@ -849,6 +906,9 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 			 N_("run tasks based on the state of the repository")),
 		OPT_BOOL(0, "quiet", &opts.quiet,
 			 N_("do not report progress or other information over stderr")),
+		OPT_CALLBACK_F(0, "task", NULL, N_("task"),
+			N_("run a specific task"),
+			PARSE_OPT_NONEG, task_option_parse),
 		OPT_END()
 	};
 
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 216ac0b19e..c09a9eb90b 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -21,4 +21,27 @@ test_expect_success 'run [--auto|--quiet]' '
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
+test_expect_success 'run --task=<task>' '
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-gc.txt" git maintenance run --task=gc &&
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-both.txt" git maintenance run --task=commit-graph --task=gc &&
+	! grep ",\"gc\"" run-commit-graph.txt  &&
+	grep ",\"gc\"" run-gc.txt  &&
+	grep ",\"gc\"" run-both.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-commit-graph.txt  &&
+	! grep ",\"commit-graph\",\"write\"" run-gc.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-both.txt
+'
+
+test_expect_success 'run --task=bogus' '
+	test_must_fail git maintenance run --task=bogus 2>err &&
+	test_i18ngrep "is not a valid task" err
+'
+
+test_expect_success 'run --task duplicate' '
+	test_must_fail git maintenance run --task=gc --task=gc 2>err &&
+	test_i18ngrep "cannot be selected multiple times" err
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 07/18] maintenance: take a lock on the objects directory
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 06/18] maintenance: add --task option Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
                     ` (12 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Performing maintenance on a Git repository involves writing data to the
.git directory, which is not safe to do with multiple writers attempting
the same operation. Ensure that only one 'git maintenance' process is
running at a time by holding a file-based lock. Simply the presence of
the .git/maintenance.lock file will prevent future maintenance. This
lock is never committed, since it does not represent meaningful data.
Instead, it is only a placeholder.

If the lock file already exists, then fail silently. This will become
very important later when we implement the 'fetch' task, as this is our
stop-gap from creating a recursive process loop between 'git fetch' and
'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index c58dea6fa5..5d99b4b805 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -826,6 +826,25 @@ static int maintenance_run(void)
 {
 	int i;
 	int result = 0;
+	struct lock_file lk;
+	struct repository *r = the_repository;
+	char *lock_path = xstrfmt("%s/maintenance", r->objects->odb->path);
+
+	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
+		/*
+		 * Another maintenance command is running.
+		 *
+		 * If --auto was provided, then it is likely due to a
+		 * recursive process stack. Do not report an error in
+		 * that case.
+		 */
+		if (!opts.auto_flag && !opts.quiet)
+			error(_("lock file '%s' exists, skipping maintenance"),
+			      lock_path);
+		free(lock_path);
+		return 0;
+	}
+	free(lock_path);
 
 	if (opts.tasks_selected)
 		QSORT(tasks, num_tasks, compare_tasks_by_selection);
@@ -840,6 +859,7 @@ static int maintenance_run(void)
 		result = tasks[i]->fn();
 	}
 
+	rollback_lock_file(&lk);
 	return result;
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 07/18] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 20:53     ` Junio C Hamano
  2020-07-25  1:37     ` Đoàn Trần Công Danh
  2020-07-23 17:56   ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.

3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/prefetch/<remote> refs that no
   longer appear on the remote.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 12 ++++++
 builtin/gc.c                      | 64 ++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            | 24 ++++++++++++
 3 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 9204762e21..0927643247 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+prefetch::
+	The `fetch` task updates the object directory with the latest objects
+	from all registered remotes. For each remote, a `git fetch` command
+	is run. The refmap is custom to avoid updating local or remote
+	branches (those in `refs/heads` or `refs/remotes`). Instead, the
+	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
+	not updated.
++
+This means that foreground fetches are still required to update the
+remote refs, but the users is notified when the branches and tags are
+updated on the remote.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index 5d99b4b805..969c127877 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -28,6 +28,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "promisor-remote.h"
+#include "remote.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -700,7 +701,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 2
+#define MAX_NUM_TASKS 3
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -781,6 +782,63 @@ static int maintenance_task_commit_graph(void)
 	return 1;
 }
 
+static int fetch_remote(const char *remote)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf refmap = STRBUF_INIT;
+
+	argv_array_pushl(&cmd, "fetch", remote, "--prune",
+			 "--no-tags", "--refmap=", NULL);
+
+	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
+	argv_array_push(&cmd, refmap.buf);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--quiet");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	strbuf_release(&refmap);
+	return result;
+}
+
+static int fill_each_remote(struct remote *remote, void *cbdata)
+{
+	struct string_list *remotes = (struct string_list *)cbdata;
+
+	string_list_append(remotes, remote->name);
+	return 0;
+}
+
+static int maintenance_task_prefetch(void)
+{
+	int result = 0;
+	struct string_list_item *item;
+	struct string_list remotes = STRING_LIST_INIT_DUP;
+
+	if (for_each_remote(fill_each_remote, &remotes)) {
+		error(_("failed to fill remotes"));
+		result = 1;
+		goto cleanup;
+	}
+
+	/*
+	 * Do not modify the result based on the success of the 'fetch'
+	 * operation, as a loss of network could cause 'fetch' to fail
+	 * quickly. We do not want that to stop the rest of our
+	 * background operations.
+	 */
+	for (item = remotes.items;
+	     item && item < remotes.items + remotes.nr;
+	     item++)
+		fetch_remote(item->string);
+
+cleanup:
+	string_list_clear(&remotes, 0);
+	return result;
+}
+
 static int maintenance_task_gc(void)
 {
 	int result;
@@ -871,6 +929,10 @@ static void initialize_tasks(void)
 	for (i = 0; i < MAX_NUM_TASKS; i++)
 		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
 
+	tasks[num_tasks]->name = "prefetch";
+	tasks[num_tasks]->fn = maintenance_task_prefetch;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index c09a9eb90b..8b04a04c79 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -44,4 +44,28 @@ test_expect_success 'run --task duplicate' '
 	test_i18ngrep "cannot be selected multiple times" err
 '
 
+test_expect_success 'run --task=prefetch with no remotes' '
+	git maintenance run --task=prefetch 2>err &&
+	test_must_be_empty err
+'
+
+test_expect_success 'prefetch multiple remotes' '
+	git clone . clone1 &&
+	git clone . clone2 &&
+	git remote add remote1 "file://$(pwd)/clone1" &&
+	git remote add remote2 "file://$(pwd)/clone2" &&
+	git -C clone1 switch -c one &&
+	git -C clone2 switch -c two &&
+	test_commit -C clone1 one &&
+	test_commit -C clone2 two &&
+	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch &&
+	grep ",\"fetch\",\"remote1\"" run-prefetch.txt &&
+	grep ",\"fetch\",\"remote2\"" run-prefetch.txt &&
+	test_path_is_missing .git/refs/remotes &&
+	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
+	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
+	git log prefetch/remote1/one &&
+	git log prefetch/remote2/two
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 20:59     ` Junio C Hamano
  2020-07-29 22:21     ` Emily Shaffer
  2020-07-23 17:56   ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  11 ++++
 builtin/gc.c                      | 106 +++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh            |  35 ++++++++++
 3 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 0927643247..557915a653 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,17 @@ gc::
 	It can also be disruptive in some situations, as it deletes stale
 	data.
 
+loose-objects::
+	The `loose-objects` job cleans up loose objects and places them into
+	pack-files. In order to prevent race conditions with concurrent Git
+	commands, it follows a two-step process. First, it deletes any loose
+	objects that already exist in a pack-file; concurrent Git processes
+	will examine the pack-file for the object data instead of the loose
+	object. Second, it creates a new pack-file (starting with "loose-")
+	containing a batch of loose objects. The batch size is limited to 50
+	thousand objects to prevent the job from taking too long on a
+	repository with many loose objects.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 969c127877..fa65304580 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -701,7 +701,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 3
+#define MAX_NUM_TASKS 4
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -858,6 +858,106 @@ static int maintenance_task_gc(void)
 	return result;
 }
 
+static int prune_packed(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "prune-packed", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--quiet");
+
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+struct write_loose_object_data {
+	FILE *in;
+	int count;
+	int batch_size;
+};
+
+static int loose_object_exists(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+				       const char *path,
+				       void *data)
+{
+	struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+	fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+	return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(void)
+{
+	struct repository *r = the_repository;
+	int result = 0;
+	struct write_loose_object_data data;
+	struct strbuf prefix = STRBUF_INIT;
+	struct child_process *pack_proc;
+
+	/*
+	 * Do not start pack-objects process
+	 * if there are no loose objects.
+	 */
+	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
+					   loose_object_exists,
+					   NULL, NULL, NULL))
+		return 0;
+
+	pack_proc = xmalloc(sizeof(*pack_proc));
+
+	child_process_init(pack_proc);
+
+	strbuf_addstr(&prefix, r->objects->odb->path);
+	strbuf_addstr(&prefix, "/pack/loose");
+
+	argv_array_pushl(&pack_proc->args, "git", "pack-objects", NULL);
+	if (opts.quiet)
+		argv_array_push(&pack_proc->args, "--quiet");
+	argv_array_push(&pack_proc->args, prefix.buf);
+
+	pack_proc->in = -1;
+
+	if (start_command(pack_proc)) {
+		error(_("failed to start 'git pack-objects' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	data.in = xfdopen(pack_proc->in, "w");
+	data.count = 0;
+	data.batch_size = 50000;
+
+	for_each_loose_file_in_objdir(r->objects->odb->path,
+				      write_loose_object_to_stdin,
+				      NULL,
+				      NULL,
+				      &data);
+
+	fclose(data.in);
+
+	if (finish_command(pack_proc)) {
+		error(_("failed to finish 'git pack-objects' process"));
+		result = 1;
+	}
+
+cleanup:
+	strbuf_release(&prefix);
+	free(pack_proc);
+	return result;
+}
+
+static int maintenance_task_loose_objects(void)
+{
+	return prune_packed() || pack_loose();
+}
+
 typedef int maintenance_task_fn(void);
 
 struct maintenance_task {
@@ -933,6 +1033,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_prefetch;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "loose-objects";
+	tasks[num_tasks]->fn = maintenance_task_loose_objects;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 8b04a04c79..94bb493733 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -68,4 +68,39 @@ test_expect_success 'prefetch multiple remotes' '
 	git log prefetch/remote2/two
 '
 
+test_expect_success 'loose-objects task' '
+	# Repack everything so we know the state of the object dir
+	git repack -adk &&
+
+	# Hack to stop maintenance from running during "git commit"
+	echo in use >.git/objects/maintenance.lock &&
+	test_commit create-loose-object &&
+	rm .git/objects/maintenance.lock &&
+
+	ls .git/objects >obj-dir-before &&
+	test_file_not_empty obj-dir-before &&
+	ls .git/objects/pack/*.pack >packs-before &&
+	test_line_count = 1 packs-before &&
+
+	# The first run creates a pack-file
+	# but does not delete loose objects.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-between &&
+	test_cmp obj-dir-before obj-dir-between &&
+	ls .git/objects/pack/*.pack >packs-between &&
+	test_line_count = 2 packs-between &&
+
+	# The second run deletes loose objects
+	# but does not create a pack-file.
+	git maintenance run --task=loose-objects &&
+	ls .git/objects >obj-dir-after &&
+	cat >expect <<-\EOF &&
+	info
+	pack
+	EOF
+	test_cmp expect obj-dir-after &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_cmp packs-between packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 10/18] maintenance: add incremental-repack task
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 22:00     ` Junio C Hamano
  2020-07-29 22:22     ` Emily Shaffer
  2020-07-23 17:56   ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'incremental-repack' task runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the incremental-repack task will delete these
   repacked pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  15 ++++
 builtin/gc.c                      | 121 +++++++++++++++++++++++++++++-
 midx.c                            |   2 +-
 midx.h                            |   1 +
 t/t7900-maintenance.sh            |  37 +++++++++
 5 files changed, 174 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 557915a653..bda8df4aaa 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -84,6 +84,21 @@ loose-objects::
 	thousand objects to prevent the job from taking too long on a
 	repository with many loose objects.
 
+incremental-repack::
+	The `incremental-repack` job repacks the object directory
+	using the `multi-pack-index` feature. In order to prevent race
+	conditions with concurrent Git commands, it follows a two-step
+	process. First, it deletes any pack-files included in the
+	`multi-pack-index` where none of the objects in the
+	`multi-pack-index` reference those pack-files; this only happens
+	if all objects in the pack-file are also stored in a newer
+	pack-file. Second, it selects a group of pack-files whose "expected
+	size" is below the batch size until the group has total expected
+	size at least the batch size; see the `--batch-size` option for
+	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
+	default batch-size is zero, which is a special case that attempts
+	to repack all pack-files into a single pack-file.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index fa65304580..eb4b01c104 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -29,6 +29,7 @@
 #include "tree.h"
 #include "promisor-remote.h"
 #include "remote.h"
+#include "midx.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -701,7 +702,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-#define MAX_NUM_TASKS 4
+#define MAX_NUM_TASKS 5
 
 static const char * const builtin_maintenance_usage[] = {
 	N_("git maintenance run [<options>]"),
@@ -958,6 +959,120 @@ static int maintenance_task_loose_objects(void)
 	return prune_packed() || pack_loose();
 }
 
+static int multi_pack_index_write(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "write", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int rewrite_multi_pack_index(void)
+{
+	struct repository *r = the_repository;
+	char *midx_name = get_midx_filename(r->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	if (multi_pack_index_write()) {
+		error(_("failed to rewrite multi-pack-index"));
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_verify(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "verify", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int multi_pack_index_expire(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "expire", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	close_object_store(the_repository->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	argv_array_clear(&cmd);
+
+	return result;
+}
+
+static int multi_pack_index_repack(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
+
+	if (opts.quiet)
+		argv_array_push(&cmd, "--no-progress");
+
+	argv_array_push(&cmd, "--batch-size=0");
+
+	close_object_store(the_repository->objects);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	if (result && multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after repack"));
+		result = rewrite_multi_pack_index();
+	}
+
+	return result;
+}
+
+static int maintenance_task_incremental_repack(void)
+{
+	if (multi_pack_index_write()) {
+		error(_("failed to write multi-pack-index"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after initial write"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_expire()) {
+		error(_("multi-pack-index expire failed"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after expire"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_repack()) {
+		error(_("multi-pack-index repack failed"));
+		return 1;
+	}
+
+	return 0;
+}
+
 typedef int maintenance_task_fn(void);
 
 struct maintenance_task {
@@ -1037,6 +1152,10 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
 	num_tasks++;
 
+	tasks[num_tasks]->name = "incremental-repack";
+	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
+	num_tasks++;
+
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
 	tasks[num_tasks]->enabled = 1;
diff --git a/midx.c b/midx.c
index 6d1584ca51..57a8a00082 100644
--- a/midx.c
+++ b/midx.c
@@ -36,7 +36,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index b18cf53bc4..baeecc70c9 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 94bb493733..3ec813979a 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'incremental-repack task' '
+	packDir=.git/objects/pack &&
+	for i in $(test_seq 1 5)
+	do
+		test_commit $i || return 1
+	done &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
+	test_tick &&
+	git pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+	test_tick &&
+	git pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+	rm -f $packDir/pack-* &&
+	rm -f $packDir/loose-* &&
+	ls $packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git maintenance run --task=incremental-repack &&
+	ls $packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git maintenance run --task=incremental-repack &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 22:15     ` Junio C Hamano
  2020-07-29 22:23     ` Emily Shaffer
  2020-07-23 17:56   ` [PATCH v2 12/18] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  19 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'incremental-repack' task, we use the
--batch-size option in 'git multi-pack-index repack'. The initial setting
used --batch-size=0 to repack everything into a single pack-file. This is
not sustaintable for a large repository. The amount of work required is
also likely to use too many system resources for a background job.

Update the 'incremental-repack' task by dynamically computing a
--batch-size option based on the current pack-file structure.

The dynamic default size is computed with this idea in mind for a client
repository that was cloned from a very large remote: there is likely one
"big" pack-file that was created at clone time. Thus, do not try
repacking it as it is likely packed efficiently by the server.

Instead, we select the second-largest pack-file, and create a batch size
that is one larger than that pack-file. If there are three or more
pack-files, then this guarantees that at least two will be combined into
a new pack-file.

Of course, this means that the second-largest pack-file size is likely
to grow over time and may eventually surpass the initially-cloned
pack-file. Recall that the pack-file batch is selected in a greedy
manner: the packs are considered from oldest to newest and are selected
if they have size smaller than the batch size until the total selected
size is larger than the batch size. Thus, that oldest "clone" pack will
be first to repack after the new data creates a pack larger than that.

We also want to place some limits on how large these pack-files become,
in order to bound the amount of time spent repacking. A maximum
batch-size of two gigabytes means that large repositories will never be
packed into a single pack-file using this job, but also that repack is
rather expensive. This is a trade-off that is valuable to have if the
maintenance is being run automatically or in the background. Users who
truly want to optimize for space and performance (and are willing to pay
the upfront cost of a full repack) can use the 'gc' task to do so.

Reported-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c           | 48 +++++++++++++++++++++++++++++++++++++++++-
 t/t7900-maintenance.sh |  5 +++--
 2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index eb4b01c104..889d97afe7 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
 	return result;
 }
 
+#define TWO_GIGABYTES (2147483647)
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+
+static off_t get_auto_pack_size(void)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly.
+	 *
+	 * The strategy we select here is to select a
+	 * size that is one more than the second largest
+	 * pack-file. This ensures that we will repack
+	 * at least two packs if there are three or more
+	 * packs.
+	 */
+	off_t max_size = 0;
+	off_t second_largest_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+	struct repository *r = the_repository;
+
+	reprepare_packed_git(r);
+	for (p = get_all_packs(r); p; p = p->next) {
+		if (p->pack_size > max_size) {
+			second_largest_size = max_size;
+			max_size = p->pack_size;
+		} else if (p->pack_size > second_largest_size)
+			second_largest_size = p->pack_size;
+	}
+
+	result_size = second_largest_size + 1;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
 static int multi_pack_index_repack(void)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+
 	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
 
 	if (opts.quiet)
 		argv_array_push(&cmd, "--no-progress");
 
-	argv_array_push(&cmd, "--batch-size=0");
+	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
+		    (uintmax_t)get_auto_pack_size());
+	argv_array_push(&cmd, batch_arg.buf);
 
 	close_object_store(the_repository->objects);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+	strbuf_release(&batch_arg);
 
 	if (result && multi_pack_index_verify()) {
 		warning(_("multi-pack-index verify failed after repack"));
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 3ec813979a..ab5c961eb9 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -134,10 +134,11 @@ test_expect_success 'incremental-repack task' '
 	test_line_count = 4 packs-between &&
 
 	# the job deletes the two old packs, and does not write
-	# a new one because only one pack remains.
+	# a new one because the batch size is not high enough to
+	# pack the largest pack-file.
 	git maintenance run --task=incremental-repack &&
 	ls .git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	test_line_count = 2 packs-after
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 12/18] maintenance: create maintenance.<task>.enabled config
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 13/18] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Currently, a normal run of "git maintenance run" will only run the 'gc'
task, as it is the only one enabled. This is mostly for backwards-
compatible reasons since "git maintenance run --auto" commands replaced
previous "git gc --auto" commands after some Git processes. Users could
manually run specific maintenance tasks by calling "git maintenance run
--task=<task>" directly.

Allow users to customize which steps are run automatically using config.
The 'maintenance.<task>.enabled' option then can turn on these other
tasks (or turn off the 'gc' task).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt             |  2 ++
 Documentation/config/maintenance.txt |  4 ++++
 Documentation/git-maintenance.txt    |  6 +++++-
 builtin/gc.c                         | 13 +++++++++++++
 t/t7900-maintenance.sh               | 12 ++++++++++++
 5 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/config/maintenance.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..2783b825f9 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -396,6 +396,8 @@ include::config/mailinfo.txt[]
 
 include::config/mailmap.txt[]
 
+include::config/maintenance.txt[]
+
 include::config/man.txt[]
 
 include::config/merge.txt[]
diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
new file mode 100644
index 0000000000..370cbfb42f
--- /dev/null
+++ b/Documentation/config/maintenance.txt
@@ -0,0 +1,4 @@
+maintenance.<task>.enabled::
+	This boolean config option controls whether the maintenance task
+	with name `<task>` is run when no `--task` option is specified.
+	By default, only `maintenance.gc.enabled` is true.
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index bda8df4aaa..4a61441bbc 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -30,7 +30,11 @@ SUBCOMMANDS
 -----------
 
 run::
-	Run one or more maintenance tasks.
+	Run one or more maintenance tasks. If one or more `--task` options
+	are specified, then those tasks are run in that order. Otherwise,
+	the tasks are determined by which `maintenance.<task>.enabled`
+	config options are true. By default, only `maintenance.gc.enabled`
+	is true.
 
 TASKS
 -----
diff --git a/builtin/gc.c b/builtin/gc.c
index 889d97afe7..b6dc4b1832 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1185,6 +1185,7 @@ static int maintenance_run(void)
 static void initialize_tasks(void)
 {
 	int i;
+	struct strbuf config_name = STRBUF_INIT;
 	num_tasks = 0;
 
 	for (i = 0; i < MAX_NUM_TASKS; i++)
@@ -1210,6 +1211,18 @@ static void initialize_tasks(void)
 	tasks[num_tasks]->name = "commit-graph";
 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
 	num_tasks++;
+
+	for (i = 0; i < num_tasks; i++) {
+		int config_value;
+
+		strbuf_setlen(&config_name, 0);
+		strbuf_addf(&config_name, "maintenance.%s.enabled", tasks[i]->name);
+
+		if (!git_config_get_bool(config_name.buf, &config_value))
+			tasks[i]->enabled = config_value;
+	}
+
+	strbuf_release(&config_name);
 }
 
 static int task_option_parse(const struct option *opt,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index ab5c961eb9..3ee51723e0 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -21,6 +21,18 @@ test_expect_success 'run [--auto|--quiet]' '
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
+test_expect_success 'maintenance.<task>.enabled' '
+	git config maintenance.gc.enabled false &&
+	git config maintenance.commit-graph.enabled true &&
+	git config maintenance.loose-objects.enabled true &&
+	GIT_TRACE2_EVENT="$(pwd)/run-config.txt" git maintenance run &&
+	! grep ",\"fetch\"" run-config.txt &&
+	! grep ",\"gc\"" run-config.txt &&
+	! grep ",\"multi-pack-index\"" run-config.txt &&
+	grep ",\"commit-graph\"" run-config.txt &&
+	grep ",\"prune-packed\"" run-config.txt
+'
+
 test_expect_success 'run --task=<task>' '
 	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
 	GIT_TRACE2_EVENT="$(pwd)/run-gc.txt" git maintenance run --task=gc &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 13/18] maintenance: use pointers to check --auto
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 12/18] maintenance: create maintenance.<task>.enabled config Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 14/18] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'git maintenance run' command has an '--auto' option. This is used
by other Git commands such as 'git commit' or 'git fetch' to check if
maintenance should be run after adding data to the repository.

Previously, this --auto option was only used to add the argument to the
'git gc' command as part of the 'gc' task. We will be expanding the
other tasks to perform a check to see if they should do work as part of
the --auto flag, when they are enabled by config.

First, update the 'gc' task to perform the auto check inside the
maintenance process. This prevents running an extra 'git gc --auto'
command when not needed. It also shows a model for other tasks.

Second, use the 'auto_condition' function pointer as a signal for
whether we enable the maintenance task under '--auto'. For instance, we
do not want to enable the 'fetch' task in '--auto' mode, so that
function pointer will remain NULL.

Now that we are not automatically calling 'git gc', a test in
t5514-fetch-multiple.sh must be changed to watch for 'git maintenance'
instead.

We continue to pass the '--auto' option to the 'git gc' command when
necessary, because of the gc.autoDetach config option changes behavior.
Likely, we will want to absorb the daemonizing behavior implied by
gc.autoDetach as a maintenance.autoDetach config option.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c              | 15 +++++++++++++++
 t/t5514-fetch-multiple.sh |  2 +-
 t/t7900-maintenance.sh    |  2 +-
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index b6dc4b1832..31696a2595 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1121,9 +1121,17 @@ static int maintenance_task_incremental_repack(void)
 
 typedef int maintenance_task_fn(void);
 
+/*
+ * An auto condition function returns 1 if the task should run
+ * and 0 if the task should NOT run. See needs_to_gc() for an
+ * example.
+ */
+typedef int maintenance_auto_fn(void);
+
 struct maintenance_task {
 	const char *name;
 	maintenance_task_fn *fn;
+	maintenance_auto_fn *auto_condition;
 	int task_order;
 	unsigned enabled:1,
 		 selected:1;
@@ -1175,6 +1183,11 @@ static int maintenance_run(void)
 		if (!opts.tasks_selected && !tasks[i]->enabled)
 			continue;
 
+		if (opts.auto_flag &&
+		    (!tasks[i]->auto_condition ||
+		     !tasks[i]->auto_condition()))
+			continue;
+
 		result = tasks[i]->fn();
 	}
 
@@ -1205,6 +1218,7 @@ static void initialize_tasks(void)
 
 	tasks[num_tasks]->name = "gc";
 	tasks[num_tasks]->fn = maintenance_task_gc;
+	tasks[num_tasks]->auto_condition = need_to_gc;
 	tasks[num_tasks]->enabled = 1;
 	num_tasks++;
 
@@ -1283,6 +1297,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 				   builtin_maintenance_options);
 
 	opts.quiet = !isatty(2);
+	gc_config();
 	initialize_tasks();
 
 	argc = parse_options(argc, argv, prefix,
diff --git a/t/t5514-fetch-multiple.sh b/t/t5514-fetch-multiple.sh
index de8e2f1531..bd202ec6f3 100755
--- a/t/t5514-fetch-multiple.sh
+++ b/t/t5514-fetch-multiple.sh
@@ -108,7 +108,7 @@ test_expect_success 'git fetch --multiple (two remotes)' '
 	 GIT_TRACE=1 git fetch --multiple one two 2>trace &&
 	 git branch -r > output &&
 	 test_cmp ../expect output &&
-	 grep "built-in: git gc" trace >gc &&
+	 grep "built-in: git maintenance" trace >gc &&
 	 test_line_count = 1 gc
 	)
 '
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 3ee51723e0..373b8dbe04 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -17,7 +17,7 @@ test_expect_success 'run [--auto|--quiet]' '
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
 	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
 	grep ",\"gc\"]" run-no-auto.txt  &&
-	grep ",\"gc\",\"--auto\"" run-auto.txt &&
+	! grep ",\"gc\",\"--auto\"" run-auto.txt &&
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 14/18] maintenance: add auto condition for commit-graph task
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (12 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 13/18] maintenance: use pointers to check --auto Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 15/18] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of writing a new commit-graph in every 'git maintenance run
--auto' process (when maintenance.commit-graph.enalbed is configured to
be true), only write when there are "enough" commits not in a
commit-graph file.

This count is controlled by the maintenance.commit-graph.auto config
option.

To compute the count, use a depth-first search starting at each ref, and
leaving markers using the PARENT1 flag. If this count reaches the limit,
then terminate early and start the task. Otherwise, this operation will
peel every ref and parse the commit it points to. If these are all in
the commit-graph, then this is typically a very fast operation. Users
with many refs might feel a slow-down, and hence could consider updating
their limit to be very small. A negative value will force the step to
run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt | 10 ++++
 builtin/gc.c                         | 76 ++++++++++++++++++++++++++++
 object.h                             |  1 +
 3 files changed, 87 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index 370cbfb42f..9bd69b9df3 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -2,3 +2,13 @@ maintenance.<task>.enabled::
 	This boolean config option controls whether the maintenance task
 	with name `<task>` is run when no `--task` option is specified.
 	By default, only `maintenance.gc.enabled` is true.
+
+maintenance.commit-graph.auto::
+	This integer config option controls how often the `commit-graph` task
+	should be run as part of `git maintenance run --auto`. If zero, then
+	the `commit-graph` task will not run with the `--auto` option. A
+	negative value will force the task to run every time. Otherwise, a
+	positive value implies the command should run when the number of
+	reachable commits that are not in the commit-graph file is at least
+	the value of `maintenance.commit-graph.auto`. The default value is
+	100.
diff --git a/builtin/gc.c b/builtin/gc.c
index 31696a2595..84ad360d17 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -30,6 +30,7 @@
 #include "promisor-remote.h"
 #include "remote.h"
 #include "midx.h"
+#include "refs.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -715,6 +716,80 @@ static struct maintenance_opts {
 	int tasks_selected;
 } opts;
 
+/* Remember to update object flag allocation in object.h */
+#define PARENT1		(1u<<16)
+
+static int num_commits_not_in_graph = 0;
+static int limit_commits_not_in_graph = 100;
+
+static int dfs_on_ref(const char *refname,
+		      const struct object_id *oid, int flags,
+		      void *cb_data)
+{
+	int result = 0;
+	struct object_id peeled;
+	struct commit_list *stack = NULL;
+	struct commit *commit;
+
+	if (!peel_ref(refname, &peeled))
+		oid = &peeled;
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	commit = lookup_commit(the_repository, oid);
+	if (!commit)
+		return 0;
+	if (parse_commit(commit))
+		return 0;
+
+	commit_list_append(commit, &stack);
+
+	while (!result && stack) {
+		struct commit_list *parent;
+
+		commit = pop_commit(&stack);
+
+		for (parent = commit->parents; parent; parent = parent->next) {
+			if (parse_commit(parent->item) ||
+			    commit_graph_position(parent->item) != COMMIT_NOT_FROM_GRAPH ||
+			    parent->item->object.flags & PARENT1)
+				continue;
+
+			parent->item->object.flags |= PARENT1;
+			num_commits_not_in_graph++;
+
+			if (num_commits_not_in_graph >= limit_commits_not_in_graph) {
+				result = 1;
+				break;
+			}
+
+			commit_list_append(parent->item, &stack);
+		}
+	}
+
+	free_commit_list(stack);
+	return result;
+}
+
+static int should_write_commit_graph(void)
+{
+	int result;
+
+	git_config_get_int("maintenance.commit-graph.auto",
+			   &limit_commits_not_in_graph);
+
+	if (!limit_commits_not_in_graph)
+		return 0;
+	if (limit_commits_not_in_graph < 0)
+		return 1;
+
+	result = for_each_ref(dfs_on_ref, NULL);
+
+	clear_commit_marks_all(PARENT1);
+
+	return result;
+}
+
 static int run_write_commit_graph(void)
 {
 	int result;
@@ -1224,6 +1299,7 @@ static void initialize_tasks(void)
 
 	tasks[num_tasks]->name = "commit-graph";
 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
+	tasks[num_tasks]->auto_condition = should_write_commit_graph;
 	num_tasks++;
 
 	for (i = 0; i < num_tasks; i++) {
diff --git a/object.h b/object.h
index 38dc2d5a6c..4f886495d7 100644
--- a/object.h
+++ b/object.h
@@ -73,6 +73,7 @@ struct object_array {
  * list-objects-filter.c:                                      21
  * builtin/fsck.c:           0--3
  * builtin/index-pack.c:                                     2021
+ * builtin/maintenance.c:                           16
  * builtin/pack-objects.c:                                   20
  * builtin/reflog.c:                   10--12
  * builtin/show-branch.c:    0-------------------------------------------26
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 15/18] maintenance: create auto condition for loose-objects
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (13 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 14/18] maintenance: add auto condition for commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 16/18] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The loose-objects task deletes loose objects that already exist in a
pack-file, then place the remaining loose objects into a new pack-file.
If this step runs all the time, then we risk creating pack-files with
very few objects with every 'git commit' process. To prevent
overwhelming the packs directory with small pack-files, place a minimum
number of objects to justify the task.

The 'maintenance.loose-objects.auto' config option specifies a minimum
number of loose objects to justify the task to run under the '--auto'
option. This defaults to 100 loose objects. Setting the value to zero
will prevent the step from running under '--auto' while a negative value
will force it to run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt |  9 +++++++++
 builtin/gc.c                         | 30 ++++++++++++++++++++++++++++
 t/t7900-maintenance.sh               | 25 +++++++++++++++++++++++
 3 files changed, 64 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index 9bd69b9df3..a9442dd260 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -12,3 +12,12 @@ maintenance.commit-graph.auto::
 	reachable commits that are not in the commit-graph file is at least
 	the value of `maintenance.commit-graph.auto`. The default value is
 	100.
+
+maintenance.loose-objects.auto::
+	This integer config option controls how often the `loose-objects` task
+	should be run as part of `git maintenance run --auto`. If zero, then
+	the `loose-objects` task will not run with the `--auto` option. A
+	negative value will force the task to run every time. Otherwise, a
+	positive value implies the command should run when the number of
+	loose objects is at least the value of `maintenance.loose-objects.auto`.
+	The default value is 100.
diff --git a/builtin/gc.c b/builtin/gc.c
index 84ad360d17..ae59a28203 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -951,6 +951,35 @@ struct write_loose_object_data {
 	int batch_size;
 };
 
+static int loose_object_auto_limit = 100;
+
+static int loose_object_count(const struct object_id *oid,
+			       const char *path,
+			       void *data)
+{
+	int *count = (int*)data;
+	if (++(*count) >= loose_object_auto_limit)
+		return 1;
+	return 0;
+}
+
+static int loose_object_auto_condition(void)
+{
+	int count = 0;
+
+	git_config_get_int("maintenance.loose-objects.auto",
+			   &loose_object_auto_limit);
+
+	if (!loose_object_auto_limit)
+		return 0;
+	if (loose_object_auto_limit < 0)
+		return 1;
+
+	return for_each_loose_file_in_objdir(the_repository->objects->odb->path,
+					     loose_object_count,
+					     NULL, NULL, &count);
+}
+
 static int loose_object_exists(const struct object_id *oid,
 			       const char *path,
 			       void *data)
@@ -1285,6 +1314,7 @@ static void initialize_tasks(void)
 
 	tasks[num_tasks]->name = "loose-objects";
 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
+	tasks[num_tasks]->auto_condition = loose_object_auto_condition;
 	num_tasks++;
 
 	tasks[num_tasks]->name = "incremental-repack";
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 373b8dbe04..e4244d7c3c 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -115,6 +115,31 @@ test_expect_success 'loose-objects task' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'maintenance.loose-objects.auto' '
+	git repack -adk &&
+	GIT_TRACE2_EVENT="$(pwd)/trace-lo1.txt" \
+		git -c maintenance.loose-objects.auto=1 maintenance \
+		run --auto --task=loose-objects &&
+	! grep "\"prune-packed\"" trace-lo1.txt &&
+	for i in 1 2
+	do
+		printf data-A-$i | git hash-object -t blob --stdin -w &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loA-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		! grep "\"prune-packed\"" trace-loA-$i &&
+		printf data-B-$i | git hash-object -t blob --stdin -w &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loB-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		grep "\"prune-packed\"" trace-loB-$i &&
+		GIT_TRACE2_EVENT="$(pwd)/trace-loC-$i" \
+			git -c maintenance.loose-objects.auto=2 \
+			maintenance run --auto --task=loose-objects &&
+		grep "\"prune-packed\"" trace-loC-$i || return 1
+	done
+'
+
 test_expect_success 'incremental-repack task' '
 	packDir=.git/objects/pack &&
 	for i in $(test_seq 1 5)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 16/18] maintenance: add incremental-repack auto condition
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (14 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 15/18] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 17/18] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The incremental-repack task updates the multi-pack-index by deleting pack-
files that have been replaced with new packs, then repacking a batch of
small pack-files into a larger pack-file. This incremental repack is faster
than rewriting all object data, but is slower than some other
maintenance activities.

The 'maintenance.incremental-repack.auto' config option specifies how many
pack-files should exist outside of the multi-pack-index before running
the step. These pack-files could be created by 'git fetch' commands or
by the loose-objects task. The default value is 10.

Setting the option to zero disables the task with the '--auto' option,
and a negative value makes the task run every time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt |  9 ++++++++
 builtin/gc.c                         | 31 ++++++++++++++++++++++++++++
 t/t7900-maintenance.sh               | 30 +++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index a9442dd260..22229e7174 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -21,3 +21,12 @@ maintenance.loose-objects.auto::
 	positive value implies the command should run when the number of
 	loose objects is at least the value of `maintenance.loose-objects.auto`.
 	The default value is 100.
+
+maintenance.incremental-repack.auto::
+	This integer config option controls how often the `incremental-repack`
+	task should be run as part of `git maintenance run --auto`. If zero,
+	then the `incremental-repack` task will not run with the `--auto`
+	option. A negative value will force the task to run every time.
+	Otherwise, a positive value implies the command should run when the
+	number of pack-files not in the multi-pack-index is at least the value
+	of `maintenance.incremental-repack.auto`. The default value is 10.
diff --git a/builtin/gc.c b/builtin/gc.c
index ae59a28203..b040c7d31d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -31,6 +31,7 @@
 #include "remote.h"
 #include "midx.h"
 #include "refs.h"
+#include "object-store.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -1063,6 +1064,35 @@ static int maintenance_task_loose_objects(void)
 	return prune_packed() || pack_loose();
 }
 
+static int incremental_repack_auto_condition(void)
+{
+	struct packed_git *p;
+	int enabled;
+	int incremental_repack_auto_limit = 10;
+	int count = 0;
+
+	if (git_config_get_bool("core.multiPackIndex", &enabled) ||
+	    !enabled)
+		return 0;
+
+	git_config_get_int("maintenance.incremental-repack.auto",
+			   &incremental_repack_auto_limit);
+
+	if (!incremental_repack_auto_limit)
+		return 0;
+	if (incremental_repack_auto_limit < 0)
+		return 1;
+
+	for (p = get_packed_git(the_repository);
+	     count < incremental_repack_auto_limit && p;
+	     p = p->next) {
+		if (!p->multi_pack_index)
+			count++;
+	}
+
+	return count >= incremental_repack_auto_limit;
+}
+
 static int multi_pack_index_write(void)
 {
 	int result;
@@ -1319,6 +1349,7 @@ static void initialize_tasks(void)
 
 	tasks[num_tasks]->name = "incremental-repack";
 	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
+	tasks[num_tasks]->auto_condition = incremental_repack_auto_condition;
 	num_tasks++;
 
 	tasks[num_tasks]->name = "gc";
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index e4244d7c3c..0b29674805 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -178,4 +178,34 @@ test_expect_success 'incremental-repack task' '
 	test_line_count = 2 packs-after
 '
 
+test_expect_success 'maintenance.incremental-repack.auto' '
+	git repack -adk &&
+	git config core.multiPackIndex true &&
+	git multi-pack-index write &&
+	GIT_TRACE2_EVENT=1 git -c maintenance.incremental-repack.auto=1 \
+		maintenance run --auto --task=incremental-repack >out &&
+	! grep "\"multi-pack-index\"" out &&
+	for i in 1 2
+	do
+		test_commit A-$i &&
+		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
+		HEAD
+		^HEAD~1
+		EOF
+		GIT_TRACE2_EVENT=$(pwd)/trace-A-$i git \
+			-c maintenance.incremental-repack.auto=2 \
+			maintenance run --auto --task=incremental-repack &&
+		! grep "\"multi-pack-index\"" trace-A-$i &&
+		test_commit B-$i &&
+		git pack-objects --revs .git/objects/pack/pack <<-\EOF &&
+		HEAD
+		^HEAD~1
+		EOF
+		GIT_TRACE2_EVENT=$(pwd)/trace-B-$i git \
+			-c maintenance.incremental-repack.auto=2 \
+			maintenance run --auto --task=incremental-repack >out &&
+		grep "\"multi-pack-index\"" trace-B-$i >/dev/null || return 1
+	done
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 17/18] midx: use start_delayed_progress()
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (15 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 16/18] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-23 17:56   ` [PATCH v2 18/18] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Now that the multi-pack-index may be written as part of auto maintenance
at the end of a command, reduce the progress output when the operations
are quick. Use start_delayed_progress() instead of start_progress().

Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that
the progress indicators are conditional.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 10 +++++-----
 t/t5319-multi-pack-index.sh | 14 +++++++-------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/midx.c b/midx.c
index 57a8a00082..d4022e4aef 100644
--- a/midx.c
+++ b/midx.c
@@ -837,7 +837,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	packs.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_progress(_("Adding packfiles to multi-pack-index"), 0);
+		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
 		packs.progress = NULL;
 
@@ -974,7 +974,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Writing chunks to multi-pack-index"),
+		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (written != chunk_offsets[i])
@@ -1109,7 +1109,7 @@ int verify_midx_file(struct repository *r, const char *object_dir, unsigned flag
 		return 0;
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Looking for referenced packfiles"),
+		progress = start_delayed_progress(_("Looking for referenced packfiles"),
 					  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		if (prepare_midx_pack(r, m, i))
@@ -1230,7 +1230,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	count = xcalloc(m->num_packs, sizeof(uint32_t));
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Counting referenced objects"),
+		progress = start_delayed_progress(_("Counting referenced objects"),
 					  m->num_objects);
 	for (i = 0; i < m->num_objects; i++) {
 		int pack_int_id = nth_midxed_pack_int_id(m, i);
@@ -1240,7 +1240,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	stop_progress(&progress);
 
 	if (flags & MIDX_PROGRESS)
-		progress = start_progress(_("Finding and deleting unreferenced packfiles"),
+		progress = start_delayed_progress(_("Finding and deleting unreferenced packfiles"),
 					  m->num_packs);
 	for (i = 0; i < m->num_packs; i++) {
 		char *pack_name;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 7214cab36c..12f41dfc18 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -172,12 +172,12 @@ test_expect_success 'write progress off for redirected stderr' '
 '
 
 test_expect_success 'write force progress on for stderr' '
-	git multi-pack-index --object-dir=$objdir --progress write 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --progress write 2>err &&
 	test_file_not_empty err
 '
 
 test_expect_success 'write with the --no-progress option' '
-	git multi-pack-index --object-dir=$objdir --no-progress write 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --no-progress write 2>err &&
 	test_line_count = 0 err
 '
 
@@ -334,17 +334,17 @@ test_expect_success 'git-fsck incorrect offset' '
 '
 
 test_expect_success 'repack progress off for redirected stderr' '
-	git multi-pack-index --object-dir=$objdir repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack 2>err &&
 	test_line_count = 0 err
 '
 
 test_expect_success 'repack force progress on for stderr' '
-	git multi-pack-index --object-dir=$objdir --progress repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --progress repack 2>err &&
 	test_file_not_empty err
 '
 
 test_expect_success 'repack with the --no-progress option' '
-	git multi-pack-index --object-dir=$objdir --no-progress repack 2>err &&
+	GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir --no-progress repack 2>err &&
 	test_line_count = 0 err
 '
 
@@ -488,7 +488,7 @@ test_expect_success 'expire progress off for redirected stderr' '
 test_expect_success 'expire force progress on for stderr' '
 	(
 		cd dup &&
-		git multi-pack-index --progress expire 2>err &&
+		GIT_PROGRESS_DELAY=0 git multi-pack-index --progress expire 2>err &&
 		test_file_not_empty err
 	)
 '
@@ -496,7 +496,7 @@ test_expect_success 'expire force progress on for stderr' '
 test_expect_success 'expire with the --no-progress option' '
 	(
 		cd dup &&
-		git multi-pack-index --no-progress expire 2>err &&
+		GIT_PROGRESS_DELAY=0 git multi-pack-index --no-progress expire 2>err &&
 		test_line_count = 0 err
 	)
 '
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2 18/18] maintenance: add trace2 regions for task execution
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (16 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 17/18] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
@ 2020-07-23 17:56   ` Derrick Stolee via GitGitGadget
  2020-07-29 22:03   ` [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
  19 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-23 17:56 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index b040c7d31d..7d9e6c34b7 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -1322,7 +1322,9 @@ static int maintenance_run(void)
 		     !tasks[i]->auto_condition()))
 			continue;
 
+		trace2_region_enter("maintenance", tasks[i]->name, r);
 		result = tasks[i]->fn();
+		trace2_region_leave("maintenance", tasks[i]->name, r);
 	}
 
 	rollback_lock_file(&lk);
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-23 17:56   ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
@ 2020-07-23 19:57     ` Junio C Hamano
  2020-07-24 12:23       ` Derrick Stolee
  2020-07-29 22:19     ` Emily Shaffer
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 19:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static void initialize_tasks(void)
> +{
> +	int i;
> +	num_tasks = 0;
> +
> +	for (i = 0; i < MAX_NUM_TASKS; i++)
> +		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
> +
> +	tasks[num_tasks]->name = "gc";
> +	tasks[num_tasks]->fn = maintenance_task_gc;
> +	tasks[num_tasks]->enabled = 1;
> +	num_tasks++;

Are we going to have 47 different tasks initialized by code like
this in the future?  I would have expected that you'd have a table
of tasks that serves as the blueprint copy and copy it to the table
to be used if there is some need to mutate the table-to-be-used.

>  }
>  
>  int cmd_maintenance(int argc, const char **argv, const char *prefix)
> @@ -751,6 +787,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
>  				   builtin_maintenance_options);
>  
>  	opts.quiet = !isatty(2);
> +	initialize_tasks();
>  
>  	argc = parse_options(argc, argv, prefix,
>  			     builtin_maintenance_options,

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 06/18] maintenance: add --task option
  2020-07-23 17:56   ` [PATCH v2 06/18] maintenance: add --task option Derrick Stolee via GitGitGadget
@ 2020-07-23 20:21     ` Junio C Hamano
  2020-07-23 22:18       ` Junio C Hamano
  2020-07-24 13:36       ` Derrick Stolee
  0 siblings, 2 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 20:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 35b0be7d40..9204762e21 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -73,6 +73,10 @@ OPTIONS
>  --quiet::
>  	Do not report progress or other information over `stderr`.
>  
> +--task=<task>::
> +	If this option is specified one or more times, then only run the
> +	specified tasks in the specified order.
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2cd17398ec..c58dea6fa5 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -710,6 +710,7 @@ static const char * const builtin_maintenance_usage[] = {
>  static struct maintenance_opts {
>  	int auto_flag;
>  	int quiet;
> +	int tasks_selected;
>  } opts;
>  
>  static int run_write_commit_graph(void)
> @@ -804,20 +805,38 @@ typedef int maintenance_task_fn(void);
>  struct maintenance_task {
>  	const char *name;
>  	maintenance_task_fn *fn;
> -	unsigned enabled:1;
> +	int task_order;
> +	unsigned enabled:1,
> +		 selected:1;
>  };
>  
>  static struct maintenance_task *tasks[MAX_NUM_TASKS];
>  static int num_tasks;
>  
> +static int compare_tasks_by_selection(const void *a_, const void *b_)
> +{
> +	const struct maintenance_task *a, *b;
> +	a = (const struct maintenance_task *)a_;
> +	b = (const struct maintenance_task *)b_;
> +
> +	return b->task_order - a->task_order;
> +}

It forces the reader to know intimately that task_order *is*
selection order in order to understand why this is "tasks by
selection".  Perhaps rename the field to match what it is
(i.e. "selection_order")?

>  static int maintenance_run(void)
>  {
>  	int i;
>  	int result = 0;
>  
> +	if (opts.tasks_selected)
> +		QSORT(tasks, num_tasks, compare_tasks_by_selection);
> +
>  	for (i = 0; !result && i < num_tasks; i++) {
> -		if (!tasks[i]->enabled)
> +		if (opts.tasks_selected && !tasks[i]->selected)
> +			continue;
> +
> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>  			continue;

I am not sure about this.  Even if the task <x> is disabled, if the
user says --task=<x>, it is run anyway?  Doesn't make an immediate
sense to me.

As I am bad at deciphering de Morgan, I'd find it easier to read if
the first were written more like so:

		if (!(!opts.tasks_selected || tasks[i]->selected))
			continue;

That is, "do not skip any when not limited, and do not skip the ones
that are selected when limited".  And that would easily extend to

		if (!tasks[i]->enabled ||
		    !(!opts.tasks_selected || tasks[i]->selected))
			continue;
> +
>  		result = tasks[i]->fn();
>  	}

> @@ -842,6 +861,44 @@ static void initialize_tasks(void)
>  	num_tasks++;
>  }
>  
> +static int task_option_parse(const struct option *opt,
> +			     const char *arg, int unset)
> +{
> +	int i;
> +	struct maintenance_task *task = NULL;
> +
> +	BUG_ON_OPT_NEG(unset);
> +
> +	if (!arg || !strlen(arg)) {
> +		error(_("--task requires a value"));
> +		return 1;

There is no need to special case an empty string that was explicitly
given as the value---it will be caught as "'' is not a valid task".

> +	}
> +
> +	opts.tasks_selected++;
> +
> +	for (i = 0; i < MAX_NUM_TASKS; i++) {
> +		if (tasks[i] && !strcasecmp(tasks[i]->name, arg)) {
> +			task = tasks[i];
> +			break;
> +		}
> +	}
> +
> +	if (!task) {
> +		error(_("'%s' is not a valid task"), arg);
> +		return 1;
> +	}
> +
> +	if (task->selected) {
> +		error(_("task '%s' cannot be selected multiple times"), arg);
> +		return 1;
> +	}
> +
> +	task->selected = 1;
> +	task->task_order = opts.tasks_selected;
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-23 17:56   ` [PATCH v2 03/18] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
@ 2020-07-23 20:21     ` Junio C Hamano
  2020-07-25  1:33       ` Taylor Blau
  2020-07-30 13:29       ` Derrick Stolee
  0 siblings, 2 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 20:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +--[no-]maintenance::
>  --[no-]auto-gc::
> -	Run `git gc --auto` at the end to perform garbage collection
> -	if needed. This is enabled by default.
> +	Run `git maintenance run --auto` at the end to perform garbage
> +	collection if needed. This is enabled by default.

Shouldn't the new synonym be called --auto-maintenance or an
abbreviation thereof?  It is not like we will run the full
maintenance suite when "--no-maintenance" is omitted, which
certainly is not the impression we want to give our readers.

>  These objects may be removed by normal Git operations (such as `git commit`)
> -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
> -If these objects are removed and were referenced by the cloned repository,
> -then the cloned repository will become corrupt.
> +which automatically call `git maintenance run --auto` and `git gc --auto`.

Hmph.  Perhaps the picture may change in the end of the series but I
got an impression that "gc --auto" would eventually become just part
of "maintenance --auto" and the users won't have to be even aware of
its existence?  Wouldn't we rather want to say something like

	--[no-]auto-maintenance::
	--[no-]auto-gc::
                Run `git maintenance run --auto` at the end to perform
                garbage collection if needed (`--[no-]auto-gc` is a
                synonym).  This is enabled by default.
	
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 82ac4be8a5..49a4d727d4 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>  			N_("report that we have only objects reachable from this object")),
>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),

> +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
> +		 N_("run 'maintenance --auto' after fetching")),
>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
> +		 N_("run 'maintenance --auto' after fetching")),

OK, so this is truly a backward-compatible synonym at this point.

> diff --git a/run-command.c b/run-command.c
> index 9b3a57d1e3..82ad241638 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
>  	return result;
>  }
>  
> -int run_auto_gc(int quiet)
> +int run_auto_maintenance(int quiet)
>  {
>  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
>  	int status;
>  
> -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
> +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
>  	if (quiet)
>  		argv_array_push(&argv_gc_auto, "--quiet");
> +	else
> +		argv_array_push(&argv_gc_auto, "--no-quiet");
> +
>  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
>  	argv_array_clear(&argv_gc_auto);
>  	return status;

Don't we want to replace all _gc_ with _maintenance_ in this
function?  I think the first business before we can do so would be
to rethink if spelling out "maintenance" fully in code is a good
idea in the first space.  It would make names for variables,
structures and fields unnecessarily long without contributing to
ease of understanding an iota, and a easy-to-remember short-form or
an abbreviation may be needed.  Using a short-form/abbreviation
wouldn't worsen the end-user experience, and not the developer
experience for that matter.

If we choose "gc" as the short-hand, most of the change in this step
would become unnecessary.  I also do not mind if we some other words
or word-fragment (perhaps "maint"???) is chosen.

> diff --git a/run-command.h b/run-command.h
> index 191dfcdafe..d9a800e700 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
>  /*
>   * Trigger an auto-gc
>   */
> -int run_auto_gc(int quiet);
> +int run_auto_maintenance(int quiet);
>  
>  #define RUN_COMMAND_NO_STDIN 1
>  #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> index a66dbe0bde..9850ecde5d 100755
> --- a/t/t5510-fetch.sh
> +++ b/t/t5510-fetch.sh
> @@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
>  		git config fetch.unpackLimit 1 &&
>  		git config gc.autoPackLimit 1 &&
>  		git config gc.autoDetach false &&
> -		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
> +		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
>  		test_i18ngrep "Auto packing the repository" fetch.out &&
>  		! grep "Should I try again" fetch.out
>  	)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-23 17:56   ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-23 20:22     ` Junio C Hamano
  2020-07-24 13:09       ` Derrick Stolee
  2020-07-29  0:22     ` Jeff King
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 20:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static int maintenance_task_commit_graph(void)
> +{
> +	struct repository *r = the_repository;
> +	char *chain_path;
> +
> +	/* Skip commit-graph when --auto is specified. */
> +	if (opts.auto_flag)
> +		return 0;

Stepping back a bit, back in "git gc" days, "--auto" had two
distinct meanings rolled into one.  Check if it even needs to be
done, and perform only the lightweight variant if needed.

For this task, there is no "lightweight variant" is possible, so
returning without checking the need to do a lightweight one makes
perfect sense here.

But wouldn't it suggest perhaps we could name "auto" field of the
options struct in a more meaningful way?  Perhaps "quick" (i.e. only
the quicker-variant of the maintenance job) or something?

> +	close_object_store(r->objects);
> +	if (run_write_commit_graph()) {
> +		error(_("failed to write commit-graph"));
> +		return 1;
> +	}
> +
> +	if (!run_verify_commit_graph())
> +		return 0;
> +
> +	warning(_("commit-graph verify caught error, rewriting"));
> +
> +	chain_path = get_commit_graph_chain_filename(r->objects->odb);
> +	if (unlink(chain_path)) {
> +		UNLEAK(chain_path);
> +		die(_("failed to remove commit-graph at %s"), chain_path);

OK.

> +	}
> +	free(chain_path);
> +
> +	if (!run_write_commit_graph())
> +		return 0;
> +
> +	error(_("failed to rewrite commit-graph"));
> +	return 1;
> +}

Error convention is "positive for error, zero for success?"  That is
a bit unusual for our internal API.

>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -768,6 +836,10 @@ static void initialize_tasks(void)
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;
>  	num_tasks++;
> +
> +	tasks[num_tasks]->name = "commit-graph";
> +	tasks[num_tasks]->fn = maintenance_task_commit_graph;
> +	num_tasks++;

Again, I am not sure if we want to keep piling on this.

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index e4e4036e50..216ac0b19e 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -12,7 +12,7 @@ test_expect_success 'help text' '
>  	test_i18ngrep "usage: git maintenance run" err
>  '
>  
> -test_expect_success 'gc [--auto|--quiet]' '
> +test_expect_success 'run [--auto|--quiet]' '

It does not look like this change belongs here.  If "run" is
appropriate title for this test at this step, it must have been so
in the previous step.

>  	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
>  	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
>  	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-23 17:56   ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
@ 2020-07-23 20:53     ` Junio C Hamano
  2020-07-24 14:25       ` Derrick Stolee
  2020-07-25  1:37     ` Đoàn Trần Công Danh
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 20:53 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects

s/fetch/prefetch/ most likely.

> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote

s/remote/remote-tracking/ definitely.  Do not forget the hyphen
between the two words.

I think it made the above unnecessarily confusing that you ended a
sentence after "is run".  It gives a wrong impression that you'd be
doing a "real fetch", which you need to dispel with a follow up
description of the refmap.

	For each remote, a `git fetch` command is run with a refspec
	to fetch their branches (those in their `refs/heads`) into
	our `refs/prefetch/<remote>/` hierarchy and without auto
	following tags (the configured refspec in the repository is
	ignored).

> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are

s/is notified/are notified/???

> +updated on the remote.

Often, when one needs to say "X.  This means Y.", X is a suboptimal
way to explain what needs to be conveyed to the readers.  But this
is not such a case.  Unlike the "This means" that is often an
attempt to rephrasing a poor explanation given first, this gives an
implication.

But let's not start with a negative impression (i.e. even with
prefetch, I still have to do X?  What's the point???), but let them
feel why it is a good thing.  Perhaps (continuing my previous
rewrite):

	This is done to avoid disrupting the remote-tracking
	branches---the end users expect them to stay unmoved unless
	they initiate a fetch.  With prefetch task, however, the
	objects necessary to complete a later real fetch would
	already be obtained, so the real fetch would go faster.  In
	the ideal case, it will just become an update to bunch of
	remote-tracking branches without any object transfer.

or something like that?  

> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);

The command line looks somewhat fishy, but I think it is correct.
At first glance it looks like a mistake to pass "--refmap=" and the
refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
but I think that is exactly what you want here, i.e.

 - defeat any refspec in the configuration with "--refmap=<empty>"

 - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
   decline auto-following, to tell what exactly are to be fetched
   and stored where.

The description in the log message about refmap needs to be
clarified, though (I've already done so in the above suggested
rewrite).

> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */

The loop that runs different tasks abort at the first failure,
though.  Perhaps that loop needs to be rethought as well?

> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Two things.

 - As I said upfront, I do not see the point of preparing the table
   with code.

 - The reason why prefetch is placed in front is probably because
   you do not want to repack before you add more objects to the
   object store.  But doesn't that imply that there is an inherent
   ordering that we, as those who are more expert on Git than the
   end users, prefer?  Is it a wise decision to let the users affect
   the order of the tasks run by giving command line options in
   different order in the previous step?


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-23 17:56   ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
@ 2020-07-23 20:59     ` Junio C Hamano
  2020-07-24 14:50       ` Derrick Stolee
  2020-07-29 22:21     ` Emily Shaffer
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 20:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Create a 'loose-objects' task for the 'git maintenance run' command.
> This helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:
>
> 1. Run 'git prune-packed' to delete any loose objects that exist
>    in a pack-file. Concurrent commands will prefer the packed
>    version of the object to the loose version. (Of course, there
>    are exceptions for commands that specifically care about the
>    location of an object. These are rare for a user to run on
>    purpose, and we hope a user that has selected background
>    maintenance will not be trying to do foreground maintenance.)

OK.  That would make sense.

> 2. Run 'git pack-objects' on a batch of loose objects. These
>    objects are grouped by scanning the loose object directories in
>    lexicographic order until listing all loose objects -or-
>    reaching 50,000 objects. This is more than enough if the loose
>    objects are created only by a user doing normal development.

I haven't seen this in action, but my gut feeling is that this would
result in horrible locality and deltification in the resulting
packfile.  The order you feed the objects to pack-objects and the
path hint you attach to each object matters quite a lot.

I do agree that it would be useful to have a task to deal with only
loose objects without touching existing packfiles.  I just am not
sure if 2. is a worthwhile thing to do.  A poorly constructed pack
will also contaminate later packfiles made without "-f" option to
"git repack".


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 10/18] maintenance: add incremental-repack task
  2020-07-23 17:56   ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
@ 2020-07-23 22:00     ` Junio C Hamano
  2020-07-24 15:03       ` Derrick Stolee
  2020-07-29 22:22     ` Emily Shaffer
  1 sibling, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 22:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
>
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."

An obvious alternative is to favor the copy in the older pack,
right?  Is the expectation that over time, most of the objects that
are relevant would reappear in newer packs, so that eventually by
favoring the copies in the newer packs, we can retire and remove the
old pack, keeping only the newer ones?

But would that assumption hold?  The old packs hold objects that are
necessary for the older parts of the history, so unless you are
cauterizing away the old history, these objects in the older packs
are likely to stay with us longer than those used by the newer parts
of the history, some of which may not even have been pushed out yet
and can be rebased away?

> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the
>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git
>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)

OK.

> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

approximates.

>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.

> +static int maintenance_task_incremental_repack(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}

Hmph, I wonder if these warning should come from each helper
functions that are static to this function anyway.

It also makes it easier to reason about this function by eliminating
the need for having a different pattern only for the verify helper.
Instead, verify could call rewrite internally when it notices a
breakage.  I.e.

	if (multi_pack_index_write())
		return 1;
	if (multi_pack_index_verify("after initial write"))
		return 1;
	if (multi_pack_index_exire())
		return 1;
	...

Also, it feels odd, compared to our internal API convention, that
positive non-zero is used as an error here.

> +	return 0;
> +}
> +
>  typedef int maintenance_task_fn(void);
>  
>  struct maintenance_task {
> @@ -1037,6 +1152,10 @@ static void initialize_tasks(void)
>  	tasks[num_tasks]->fn = maintenance_task_loose_objects;
>  	num_tasks++;
>  
> +	tasks[num_tasks]->name = "incremental-repack";
> +	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Exactly the same comment as 08/18 about natural/inherent ordering
applies here as well.

Thanks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 17:56   ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
@ 2020-07-23 22:15     ` Junio C Hamano
  2020-07-23 23:09       ` Eric Sunshine
  2020-07-24 19:51       ` Derrick Stolee
  2020-07-29 22:23     ` Emily Shaffer
  1 sibling, 2 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 22:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When repacking during the 'incremental-repack' task, we use the
> --batch-size option in 'git multi-pack-index repack'. The initial setting
> used --batch-size=0 to repack everything into a single pack-file. This is
> not sustaintable for a large repository. The amount of work required is

sustainable.

> also likely to use too many system resources for a background job.
>
> Update the 'incremental-repack' task by dynamically computing a
> --batch-size option based on the current pack-file structure.

OK.

> The dynamic default size is computed with this idea in mind for a client
> repository that was cloned from a very large remote: there is likely one
> "big" pack-file that was created at clone time. Thus, do not try
> repacking it as it is likely packed efficiently by the server.
>
> Instead, we select the second-largest pack-file, and create a batch size
> that is one larger than that pack-file. If there are three or more
> pack-files, then this guarantees that at least two will be combined into
> a new pack-file.
>
> Of course, this means that the second-largest pack-file size is likely
> to grow over time and may eventually surpass the initially-cloned
> pack-file. Recall that the pack-file batch is selected in a greedy
> manner: the packs are considered from oldest to newest and are selected
> if they have size smaller than the batch size until the total selected
> size is larger than the batch size. Thus, that oldest "clone" pack will
> be first to repack after the new data creates a pack larger than that.
>
> We also want to place some limits on how large these pack-files become,
> in order to bound the amount of time spent repacking. A maximum
> batch-size of two gigabytes means that large repositories will never be
> packed into a single pack-file using this job, but also that repack is
> rather expensive. This is a trade-off that is valuable to have if the
> maintenance is being run automatically or in the background. Users who
> truly want to optimize for space and performance (and are willing to pay
> the upfront cost of a full repack) can use the 'gc' task to do so.

It might be too late to ask this now, but how does the quality of
the resulting combined pack ensured, wrt locality and deltification?

> +#define TWO_GIGABYTES (2147483647)
> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
> +
> +static off_t get_auto_pack_size(void)
> +{
> +	/*
> +	 * The "auto" value is special: we optimize for
> +	 * one large pack-file (i.e. from a clone) and
> +	 * expect the rest to be small and they can be
> +	 * repacked quickly.
> +	 *
> +	 * The strategy we select here is to select a
> +	 * size that is one more than the second largest
> +	 * pack-file. This ensures that we will repack
> +	 * at least two packs if there are three or more
> +	 * packs.
> +	 */
> +	off_t max_size = 0;
> +	off_t second_largest_size = 0;
> +	off_t result_size;
> +	struct packed_git *p;
> +	struct repository *r = the_repository;
> +
> +	reprepare_packed_git(r);
> +	for (p = get_all_packs(r); p; p = p->next) {
> +		if (p->pack_size > max_size) {
> +			second_largest_size = max_size;
> +			max_size = p->pack_size;
> +		} else if (p->pack_size > second_largest_size)
> +			second_largest_size = p->pack_size;
> +	}
> +
> +	result_size = second_largest_size + 1;

We won't worry about this addition wrapping around; I guess we
cannot do anything intelligent when it happens.

> +	/* But limit ourselves to a batch size of 2g */
> +	if (result_size > TWO_GIGABYTES)
> +		result_size = TWO_GIGABYTES;

Well, when it happens, we'd cap to 2G, which must be a reasonable
fallback value, so it would be OK.

> +	return result_size;
> +}
> +
>  static int multi_pack_index_repack(void)
>  {
>  	int result;
>  	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf batch_arg = STRBUF_INIT;
> +
>  	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
>  
>  	if (opts.quiet)
>  		argv_array_push(&cmd, "--no-progress");
>  
> -	argv_array_push(&cmd, "--batch-size=0");
> +	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
> +		    (uintmax_t)get_auto_pack_size());
> +	argv_array_push(&cmd, batch_arg.buf);
>  
>  	close_object_store(the_repository->objects);
>  	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	strbuf_release(&batch_arg);

I think I saw a suggestion to use xstrfmt() with free()  instead of
the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
in a similar but different context.  Perhaps we should follow suit
here, too?


Thanks.  That's it for today from me.


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 06/18] maintenance: add --task option
  2020-07-23 20:21     ` Junio C Hamano
@ 2020-07-23 22:18       ` Junio C Hamano
  2020-07-24 13:36       ` Derrick Stolee
  1 sibling, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 22:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>>  	for (i = 0; !result && i < num_tasks; i++) {
>> -		if (!tasks[i]->enabled)
>> +		if (opts.tasks_selected && !tasks[i]->selected)
>> +			continue;
>> +
>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>  			continue;
>
> I am not sure about this.  Even if the task <x> is disabled, if the
> user says --task=<x>, it is run anyway?  Doesn't make an immediate
> sense to me.

OK, after seeing the title of 12/18, it does make sense.  Even if it
is disabled by default, you could still choose to run, which makes
sense.


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 22:15     ` Junio C Hamano
@ 2020-07-23 23:09       ` Eric Sunshine
  2020-07-23 23:24         ` Junio C Hamano
  2020-07-24 19:51       ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Eric Sunshine @ 2020-07-23 23:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, Git List, Johannes Schindelin,
	brian m. carlson, steadmon, Jonathan Nieder, Jeff King,
	Doan Tran Cong Danh, Phillip Wood, Emily Shaffer, Son Luong Ngoc,
	Jonathan Tan, Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > +     struct strbuf batch_arg = STRBUF_INIT;
> > +
> > -     argv_array_push(&cmd, "--batch-size=0");
> > +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
> > +                 (uintmax_t)get_auto_pack_size());
> > +     argv_array_push(&cmd, batch_arg.buf);
> >
> > +     strbuf_release(&batch_arg);
>
> I think I saw a suggestion to use xstrfmt() with free()  instead of
> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
> in a similar but different context.  Perhaps we should follow suit
> here, too?

Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
be even simpler?

    argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
        (uintmax_t)get_auto_pack_size());

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 23:09       ` Eric Sunshine
@ 2020-07-23 23:24         ` Junio C Hamano
  2020-07-24 16:09           ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-23 23:24 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Derrick Stolee via GitGitGadget, Git List, Johannes Schindelin,
	brian m. carlson, steadmon, Jonathan Nieder, Jeff King,
	Doan Tran Cong Danh, Phillip Wood, Emily Shaffer, Son Luong Ngoc,
	Jonathan Tan, Derrick Stolee, Derrick Stolee

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> > +     struct strbuf batch_arg = STRBUF_INIT;
>> > +
>> > -     argv_array_push(&cmd, "--batch-size=0");
>> > +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
>> > +                 (uintmax_t)get_auto_pack_size());
>> > +     argv_array_push(&cmd, batch_arg.buf);
>> >
>> > +     strbuf_release(&batch_arg);
>>
>> I think I saw a suggestion to use xstrfmt() with free()  instead of
>> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
>> in a similar but different context.  Perhaps we should follow suit
>> here, too?
>
> Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
> be even simpler?
>
>     argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
>         (uintmax_t)get_auto_pack_size());

No, it was me who was missing an obvious and better alternative.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-23 19:57     ` Junio C Hamano
@ 2020-07-24 12:23       ` Derrick Stolee
  2020-07-24 12:51         ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 12:23 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 3:57 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +static void initialize_tasks(void)
>> +{
>> +	int i;
>> +	num_tasks = 0;
>> +
>> +	for (i = 0; i < MAX_NUM_TASKS; i++)
>> +		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>> +
>> +	tasks[num_tasks]->name = "gc";
>> +	tasks[num_tasks]->fn = maintenance_task_gc;
>> +	tasks[num_tasks]->enabled = 1;
>> +	num_tasks++;
> 
> Are we going to have 47 different tasks initialized by code like
> this in the future?  I would have expected that you'd have a table
> of tasks that serves as the blueprint copy and copy it to the table
> to be used if there is some need to mutate the table-to-be-used.

Making it a table will likely make it easier to read. I hadn't
thought of it.

At the start, I thought that the diff would look awful as we add
members to the struct. However, the members that are not specified
are set to zero, so I should be able to craft this into something
not too terrible.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-24 12:23       ` Derrick Stolee
@ 2020-07-24 12:51         ` Derrick Stolee
  2020-07-24 19:39           ` Junio C Hamano
  2020-07-25  1:46           ` Taylor Blau
  0 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 12:51 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/24/2020 8:23 AM, Derrick Stolee wrote:
> On 7/23/2020 3:57 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> +static void initialize_tasks(void)
>>> +{
>>> +	int i;
>>> +	num_tasks = 0;
>>> +
>>> +	for (i = 0; i < MAX_NUM_TASKS; i++)
>>> +		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>>> +
>>> +	tasks[num_tasks]->name = "gc";
>>> +	tasks[num_tasks]->fn = maintenance_task_gc;
>>> +	tasks[num_tasks]->enabled = 1;
>>> +	num_tasks++;
>>
>> Are we going to have 47 different tasks initialized by code like
>> this in the future?  I would have expected that you'd have a table
>> of tasks that serves as the blueprint copy and copy it to the table
>> to be used if there is some need to mutate the table-to-be-used.
> 
> Making it a table will likely make it easier to read. I hadn't
> thought of it.
> 
> At the start, I thought that the diff would look awful as we add
> members to the struct. However, the members that are not specified
> are set to zero, so I should be able to craft this into something
> not too terrible.

OK, my attempt has led to this final table:

	const struct maintenance_task default_tasks[] = {
		{
			"prefetch",
			maintenance_task_prefetch,
		},
		{
			"loose-objects",
			maintenance_task_loose_objects,
			loose_object_auto_condition,
		},
		{
			"incremental-repack",
			maintenance_task_incremental_repack,
			incremental_repack_auto_condition,
		},
		{
			"gc",
			maintenance_task_gc,
			need_to_gc,
			1,
		},
		{
			"commit-graph",
			maintenance_task_commit_graph,
			should_write_commit_graph,
		}
	};
	num_tasks = sizeof(default_tasks) / sizeof(struct maintenance_task);

This is followed by allocating and copying the data to the
'tasks' array, allowing it to be sorted and modified according
to command-line arguments and config.

Is this what you intended?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-23 20:22     ` Junio C Hamano
@ 2020-07-24 13:09       ` Derrick Stolee
  2020-07-24 19:47         ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 13:09 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 4:22 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +static int maintenance_task_commit_graph(void)
>> +{
>> +	struct repository *r = the_repository;
>> +	char *chain_path;
>> +
>> +	/* Skip commit-graph when --auto is specified. */
>> +	if (opts.auto_flag)
>> +		return 0;

Now that you point this out, this is actually a stray condition
from an earlier version. We now have the ".enabled" config and
the auto condition function pointer. That handles all of that
"should we run this when --auto is specified?" logic outside of
the task itself.
 
> Stepping back a bit, back in "git gc" days, "--auto" had two
> distinct meanings rolled into one.  Check if it even needs to be
> done, and perform only the lightweight variant if needed.
> 
> For this task, there is no "lightweight variant" is possible, so
> returning without checking the need to do a lightweight one makes
> perfect sense here.
> 
> But wouldn't it suggest perhaps we could name "auto" field of the
> options struct in a more meaningful way?  Perhaps "quick" (i.e. only
> the quicker-variant of the maintenance job) or something?

But you are discussing here how the _behavior_ can change when
--auto is specified. And specifically, "git gc --auto" really
meant "This is running after a foreground command, so only do
work if necessary and do it quickly to minimize blocking time."

I'd be happy to replace "--auto" with "--quick" in the
maintenance builtin.

This opens up some extra design space for how the individual
tasks perform depending on "--quick" being specified or not.
My intention was to create tasks that are already in "quick"
mode:

* loose-objects have a maximum batch size.
* incremental-repack is capped in size.
* commit-graph uses the --split option.

But this "quick" distinction might be important for some of
the tasks we intend to extract from the gc builtin.

>> +	close_object_store(r->objects);
>> +	if (run_write_commit_graph()) {
>> +		error(_("failed to write commit-graph"));
>> +		return 1;
>> +	}
>> +
>> +	if (!run_verify_commit_graph())
>> +		return 0;
>> +
>> +	warning(_("commit-graph verify caught error, rewriting"));
>> +
>> +	chain_path = get_commit_graph_chain_filename(r->objects->odb);
>> +	if (unlink(chain_path)) {
>> +		UNLEAK(chain_path);
>> +		die(_("failed to remove commit-graph at %s"), chain_path);
> 
> OK.
> 
>> +	}
>> +	free(chain_path);
>> +
>> +	if (!run_write_commit_graph())
>> +		return 0;
>> +
>> +	error(_("failed to rewrite commit-graph"));
>> +	return 1;
>> +}
> 
> Error convention is "positive for error, zero for success?"  That is
> a bit unusual for our internal API.

Since the tasks are frequently running subcommands, returning
0 for success and non-zero for error matches the error codes
returned by those subcommands.

Should I instead change the behavior and clearly document that
task functions matching maintenance_task_fn follow this error
pattern?

>> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
>> index e4e4036e50..216ac0b19e 100755
>> --- a/t/t7900-maintenance.sh
>> +++ b/t/t7900-maintenance.sh
>> @@ -12,7 +12,7 @@ test_expect_success 'help text' '
>>  	test_i18ngrep "usage: git maintenance run" err
>>  '
>>  
>> -test_expect_success 'gc [--auto|--quiet]' '
>> +test_expect_success 'run [--auto|--quiet]' '
> 
> It does not look like this change belongs here.  If "run" is
> appropriate title for this test at this step, it must have been so
> in the previous step.

Thanks. Will fix.

-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 06/18] maintenance: add --task option
  2020-07-23 20:21     ` Junio C Hamano
  2020-07-23 22:18       ` Junio C Hamano
@ 2020-07-24 13:36       ` Derrick Stolee
  2020-07-24 19:50         ` Junio C Hamano
  1 sibling, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 13:36 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 4:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +static int compare_tasks_by_selection(const void *a_, const void *b_)
>> +{
>> +	const struct maintenance_task *a, *b;
>> +	a = (const struct maintenance_task *)a_;
>> +	b = (const struct maintenance_task *)b_;
>> +
>> +	return b->task_order - a->task_order;
>> +}
> 
> It forces the reader to know intimately that task_order *is*
> selection order in order to understand why this is "tasks by
> selection".  Perhaps rename the field to match what it is
> (i.e. "selection_order")?

Good idea. I made this fix locally.

>>  static int maintenance_run(void)
>>  {
>>  	int i;
>>  	int result = 0;
>>  
>> +	if (opts.tasks_selected)
>> +		QSORT(tasks, num_tasks, compare_tasks_by_selection);
>> +
>>  	for (i = 0; !result && i < num_tasks; i++) {
>> -		if (!tasks[i]->enabled)
>> +		if (opts.tasks_selected && !tasks[i]->selected)
>> +			continue;
>> +
>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>  			continue;
> 
> I am not sure about this.  Even if the task <x> is disabled, if the
> user says --task=<x>, it is run anyway?  Doesn't make an immediate
> sense to me.

You already replied that you figured this out. However, I could make
it easier by adding some foreshadowing in the commit message here.

> As I am bad at deciphering de Morgan, I'd find it easier to read if
> the first were written more like so:
> 
> 		if (!(!opts.tasks_selected || tasks[i]->selected))
> 			continue;
> 
> That is, "do not skip any when not limited, and do not skip the ones
> that are selected when limited".  And that would easily extend to
> 
> 		if (!tasks[i]->enabled ||
> 		    !(!opts.tasks_selected || tasks[i]->selected))
> 			continue;

This isn't quite right, due to the confusing nature of "enabled".
The condition here will _never_ allow selecting a disabled task.

Perhaps it would be better to rename 'enabled' to 'run_by_default'?
That would make it clear that it is choosing which tasks to run unless
specified otherwise with --task=<task> here, the config option
maintenance.<task>.enabled later, and the --auto conditions even later.
Looking even farther down the line (into the next series) there will be
similar checks for auto-conditionschecking time-based schedules.

Since this loop becomes more complicated in the future, I specifically
wanted to group the "skip this task" conditions into their own if
blocks:

	1. If the user didn't specify --task=<task> explicitly and this
 	   task is disabled, then skip this task.

	2. If the user _did_ specify --task=<task> explicitly and this
	   task was not on the list, then skip this task.

	3. If the user specified --auto and the auto condition fails,
	   then skip this task.

	4. (Later) If the user specified --scheduled and the time since
	   the last run is too soon, then skip this task.

With this being the planned future, I'd prefer these be split out as
separate if conditions instead of a giant combined if. And since that
is the plan, then I won't work too hard to combine conditions 1 and 2
into a single condition.

>> +
>>  		result = tasks[i]->fn();
>>  	}
> 
>> @@ -842,6 +861,44 @@ static void initialize_tasks(void)
>>  	num_tasks++;
>>  }
>>  
>> +static int task_option_parse(const struct option *opt,
>> +			     const char *arg, int unset)
>> +{
>> +	int i;
>> +	struct maintenance_task *task = NULL;
>> +
>> +	BUG_ON_OPT_NEG(unset);
>> +
>> +	if (!arg || !strlen(arg)) {
>> +		error(_("--task requires a value"));
>> +		return 1;
> 
> There is no need to special case an empty string that was explicitly
> given as the value---it will be caught as "'' is not a valid task".

Sounds good. No need for this extra message.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-23 20:53     ` Junio C Hamano
@ 2020-07-24 14:25       ` Derrick Stolee
  2020-07-24 20:47         ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 14:25 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 4:53 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
>> index 9204762e21..0927643247 100644
>> --- a/Documentation/git-maintenance.txt
>> +++ b/Documentation/git-maintenance.txt
>> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>>  `commit-graph-chain` file. They will be deleted by a later run based on
>>  the expiration delay.
>>  
>> +prefetch::
>> +	The `fetch` task updates the object directory with the latest objects
> 
> s/fetch/prefetch/ most likely.
> 
>> +	from all registered remotes. For each remote, a `git fetch` command
>> +	is run. The refmap is custom to avoid updating local or remote
> 
> s/remote/remote-tracking/ definitely.  Do not forget the hyphen
> between the two words.
> 
> I think it made the above unnecessarily confusing that you ended a
> sentence after "is run".  It gives a wrong impression that you'd be
> doing a "real fetch", which you need to dispel with a follow up
> description of the refmap.
> 
> 	For each remote, a `git fetch` command is run with a refspec
> 	to fetch their branches (those in their `refs/heads`) into
> 	our `refs/prefetch/<remote>/` hierarchy and without auto
> 	following tags (the configured refspec in the repository is
> 	ignored).
> 
>> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
>> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
>> +	not updated.
>> ++
>> +This means that foreground fetches are still required to update the
>> +remote refs, but the users is notified when the branches and tags are
> 
> s/is notified/are notified/???
> 
>> +updated on the remote.
> 
> Often, when one needs to say "X.  This means Y.", X is a suboptimal
> way to explain what needs to be conveyed to the readers.  But this
> is not such a case.  Unlike the "This means" that is often an
> attempt to rephrasing a poor explanation given first, this gives an
> implication.
> 
> But let's not start with a negative impression (i.e. even with
> prefetch, I still have to do X?  What's the point???), but let them
> feel why it is a good thing.  Perhaps (continuing my previous
> rewrite):
> 
> 	This is done to avoid disrupting the remote-tracking
> 	branches---the end users expect them to stay unmoved unless
> 	they initiate a fetch.  With prefetch task, however, the
> 	objects necessary to complete a later real fetch would
> 	already be obtained, so the real fetch would go faster.  In
> 	the ideal case, it will just become an update to bunch of
> 	remote-tracking branches without any object transfer.
> 
> or something like that?  

I like this clarification and have adapted it with minimal edits.

>> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
>> +			 "--no-tags", "--refmap=", NULL);
>> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
>> +	argv_array_push(&cmd, refmap.buf);
> 
> The command line looks somewhat fishy, but I think it is correct.
> At first glance it looks like a mistake to pass "--refmap=" and the
> refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
> but I think that is exactly what you want here, i.e.
> 
>  - defeat any refspec in the configuration with "--refmap=<empty>"
> 
>  - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
>    decline auto-following, to tell what exactly are to be fetched
>    and stored where.
> 
> The description in the log message about refmap needs to be
> clarified, though (I've already done so in the above suggested
> rewrite).

I could have made your life easier by referring to b40a50264ac
(fetch: document and test --refmap="", 2020-01-21) in my commit
message. It includes this sentence in Documentation/fetch-options.txt:

  Providing an empty `<refspec>` to the `--refmap` option causes
  Git to ignore the configured refspecs and rely entirely on the
  refspecs supplied as command-line arguments.

>> +static int maintenance_task_prefetch(void)
>> +{
>> +	int result = 0;
>> +	struct string_list_item *item;
>> +	struct string_list remotes = STRING_LIST_INIT_DUP;
>> +
>> +	if (for_each_remote(fill_each_remote, &remotes)) {
>> +		error(_("failed to fill remotes"));
>> +		result = 1;
>> +		goto cleanup;
>> +	}
>> +
>> +	/*
>> +	 * Do not modify the result based on the success of the 'fetch'
>> +	 * operation, as a loss of network could cause 'fetch' to fail
>> +	 * quickly. We do not want that to stop the rest of our
>> +	 * background operations.
>> +	 */
> 
> The loop that runs different tasks abort at the first failure,
> though.  Perhaps that loop needs to be rethought as well?

You're right. These maintenance tasks are intended to be
independent of each other, so let's try all of them and
report a failure after all have been given an opportunity
to run. That makes this failure behavior unnecessary.

>> +	for (item = remotes.items;
>> +	     item && item < remotes.items + remotes.nr;
>> +	     item++)
>> +		fetch_remote(item->string);
>> +
>> +cleanup:
>> +	string_list_clear(&remotes, 0);
>> +	return result;
>> +}
>> +
>>  static int maintenance_task_gc(void)
>>  {
>>  	int result;
>> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>>  
>> +	tasks[num_tasks]->name = "prefetch";
>> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
>> +	num_tasks++;
>> +
>>  	tasks[num_tasks]->name = "gc";
>>  	tasks[num_tasks]->fn = maintenance_task_gc;
>>  	tasks[num_tasks]->enabled = 1;
> 
> Two things.
> 
>  - As I said upfront, I do not see the point of preparing the table
>    with code.
> 
>  - The reason why prefetch is placed in front is probably because
>    you do not want to repack before you add more objects to the
>    object store.  But doesn't that imply that there is an inherent
>    ordering that we, as those who are more expert on Git than the
>    end users, prefer?  Is it a wise decision to let the users affect
>    the order of the tasks run by giving command line options in
>    different order in the previous step?

I don't anticipate users specifying --task=<task> very often, as
it requires deep knowledge of the tasks. If a user _does_ use the
option, then we should trust their order as they might have a
good reason to choose that order.

Generally, my philosophy is to provide expert users with flexible
choices while creating sensible defaults for non-expert users.

That said, if this sort is more of a problem than the value it
provides, then I can drop the sort and the duplicate --task=<task>
check.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-23 20:59     ` Junio C Hamano
@ 2020-07-24 14:50       ` Derrick Stolee
  2020-07-24 19:57         ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 14:50 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 4:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>>    in a pack-file. Concurrent commands will prefer the packed
>>    version of the object to the loose version. (Of course, there
>>    are exceptions for commands that specifically care about the
>>    location of an object. These are rare for a user to run on
>>    purpose, and we hope a user that has selected background
>>    maintenance will not be trying to do foreground maintenance.)
> 
> OK.  That would make sense.
> 
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>>    objects are grouped by scanning the loose object directories in
>>    lexicographic order until listing all loose objects -or-
>>    reaching 50,000 objects. This is more than enough if the loose
>>    objects are created only by a user doing normal development.
> 
> I haven't seen this in action, but my gut feeling is that this would
> result in horrible locality and deltification in the resulting
> packfile.  The order you feed the objects to pack-objects and the
> path hint you attach to each object matters quite a lot.
> 
> I do agree that it would be useful to have a task to deal with only
> loose objects without touching existing packfiles.  I just am not
> sure if 2. is a worthwhile thing to do.  A poorly constructed pack
> will also contaminate later packfiles made without "-f" option to
> "git repack".

There are several factors going on here:

 * In a partial clone, it is likely that we get loose objects only
   due to a command like "git log -p" that downloads blobs
   one-by-one. In such a case, this step coming in later and picking
   up those blobs _will_ find good deltas because they are present
   in the same batch.

 * (I know this case isn't important to core Git, but please indulge
   me) In a VFS for Git repo, the loose objects correspond to blobs
   that were faulted in by a virtual filesystem read. In this case,
   the blobs are usually from a single commit in history, so good
   deltas between the blobs don't actually exist!

 * My experience indicates that the packs created by the
   loose-objects task are rather small (when created daily). This
   means that they get selected by the incremental-repack task to
   repack into a new pack-file where deltas are recomputed with modest
   success. As mentioned in that task, we saw a significant compression
   factor using that step for users of the Windows OS repo, mostly due
   to recomputing tree deltas.

 * Some amount of "extra" space is expected with this incremental
   repacking scheme. The most space-efficient thing to do is a full
   repack along with a tree walk that detects the paths used for each
   blob, allowing better hints for delta compression. However, that
   operation is very _time_ consuming. The trade-off here is something
   I should make more explicit. In my experience, disk space is cheap
   but CPU time is expensive. Most repositories could probably do a
   daily repack without being a disruption to the user. These steps
   enable maintenance for repositories where a full repack is too
   disruptive.

I hope this adds some context. I would love if someone who knows more
about delta compression could challenge my assumptions. Sharing that
expertise can help create better maintenance strategies. Junio's
initial concern here is a good first step in that direction.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 10/18] maintenance: add incremental-repack task
  2020-07-23 22:00     ` Junio C Hamano
@ 2020-07-24 15:03       ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 15:03 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 6:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>>    one did not exist, and otherwise will update the multi-pack-index
>>    with any new pack-files that appeared since the last write. This
>>    is particularly relevant with the background fetch job.
>>
>>    When the multi-pack-index sees two copies of the same object, it
>>    stores the offset data into the newer pack-file. This means that
>>    some old pack-files could become "unreferenced" which I will use
>>    to mean "a pack-file that is in the pack-file list of the
>>    multi-pack-index but none of the objects in the multi-pack-index
>>    reference a location inside that pack-file."
> 
> An obvious alternative is to favor the copy in the older pack,
> right?  Is the expectation that over time, most of the objects that
> are relevant would reappear in newer packs, so that eventually by
> favoring the copies in the newer packs, we can retire and remove the
> old pack, keeping only the newer ones?
> 
> But would that assumption hold?  The old packs hold objects that are
> necessary for the older parts of the history, so unless you are
> cauterizing away the old history, these objects in the older packs
> are likely to stay with us longer than those used by the newer parts
> of the history, some of which may not even have been pushed out yet
> and can be rebased away?

If we created a new pack-file containing an already-packed object,
then shouldn't we assume that the new pack-file does a _better_ job
of compressing that object? Or at least, doesn't make it worse?

For example, if we use 'git multi-pack-index repack --batch-size=0',
then this creates a new pack-file containing every previously-packed
object. This new pack-file should have better delta compression than
the previous setup across multiple pack-files. We want this new
pack-file to be used, not the old ones.

This "pick the newest pack" strategy is also what allows us to safely
use the 'expire' option to drop old pack-files. If we always keep the
old copies, then when we try to switch the new pack-files, we cannot
delete the old packs safely because a concurrent Git process could
be trying to reference it. One race is as follows:

 * Process A opens the multi-pack-index referring to old pack P. It
   doesn't open the pack-file as it hasn't tried to parse objects yet.

 * Process B is 'git multi-pack-index expire'. It sees that pack P can
   be dropped because all objects appear in newer pack-files. It deletes
   P.

 * Process A tries to read from P, and this fails. A must reprepare its
   representation of the pack-files.

This is the less disruptive race since A can recover with a small cost
to its performance. The worse race (on Windows) is this:

 * Process A loads the multi-pack-index and tries to parse an object by
   loading "old" pack P.

 * Process B tries to delete P. However, on Windows the handle to P by
   A prevents the deletion.

At this point, there could be two resolutions. The first is to have
the 'expire' fail because we can't delete A. This means we might never
delete A in a busy repository. The second is that the 'expire' command
continues and drops A from the list in the multi-pack-index. However,
now all Git processes add A to the packed_git list because it isn't
referenced by the multi-pack-index.

In summary: the decision to pick the newer copies is a fundamental
part of how the write->expire->repack loop was designed.

>> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>>    and updaes the multi-pack-index to drop those pack-files from the
>>    list. This is safe to do as concurrent Git processes will see the
>>    multi-pack-index and not open those packs when looking for object
>>    contents. (Similar to the 'loose-objects' job, there are some Git
>>    commands that open pack-files regardless of the multi-pack-index,
>>    but they are rarely used. Further, a user that self-selects to
>>    use background operations would likely refrain from using those
>>    commands.)
> 
> OK.
> 
>> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>>    of pack-files that are listed in the multi-pack-index and creates
>>    a new pack-file containing the objects whose offsets are listed
>>    by the multi-pack-index to be in those objects. The set of pack-
>>    files is selected greedily by sorting the pack-files by modified
>>    time and adding a pack-file to the set if its "expected size" is
>>    smaller than the batch size until the total expected size of the
>>    selected pack-files is at least the batch size. The "expected
>>    size" is calculated by taking the size of the pack-file divided
>>    by the number of objects in the pack-file and multiplied by the
>>    number of objects from the multi-pack-index with offset in that
>>    pack-file. The expected size approximats how much data from that
> 
> approximates.
> 
>>    pack-file will contribute to the resulting pack-file size. The
>>    intention is that the resulting pack-file will be close in size
>>    to the provided batch size.
> 
>> +static int maintenance_task_incremental_repack(void)
>> +{
>> +	if (multi_pack_index_write()) {
>> +		error(_("failed to write multi-pack-index"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after initial write"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +
>> +	if (multi_pack_index_expire()) {
>> +		error(_("multi-pack-index expire failed"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after expire"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +	if (multi_pack_index_repack()) {
>> +		error(_("multi-pack-index repack failed"));
>> +		return 1;
>> +	}
> 
> Hmph, I wonder if these warning should come from each helper
> functions that are static to this function anyway.
> 
> It also makes it easier to reason about this function by eliminating
> the need for having a different pattern only for the verify helper.
> Instead, verify could call rewrite internally when it notices a
> breakage.  I.e.
> 
> 	if (multi_pack_index_write())
> 		return 1;
> 	if (multi_pack_index_verify("after initial write"))
> 		return 1;
> 	if (multi_pack_index_exire())
> 		return 1;
> 	...

This is a cleaner model. I'll work on that.

> Also, it feels odd, compared to our internal API convention, that
> positive non-zero is used as an error here.
...
> Exactly the same comment as 08/18 about natural/inherent ordering
> applies here as well.

I'll leave these to be resolved in the earlier messages.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 23:24         ` Junio C Hamano
@ 2020-07-24 16:09           ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 16:09 UTC (permalink / raw)
  To: Junio C Hamano, Eric Sunshine
  Cc: Derrick Stolee via GitGitGadget, Git List, Johannes Schindelin,
	brian m. carlson, steadmon, Jonathan Nieder, Jeff King,
	Doan Tran Cong Danh, Phillip Wood, Emily Shaffer, Son Luong Ngoc,
	Jonathan Tan, Derrick Stolee, Derrick Stolee

On 7/23/2020 7:24 PM, Junio C Hamano wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> 
>> On Thu, Jul 23, 2020 at 6:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>> +     struct strbuf batch_arg = STRBUF_INIT;
>>>> +
>>>> -     argv_array_push(&cmd, "--batch-size=0");
>>>> +     strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
>>>> +                 (uintmax_t)get_auto_pack_size());
>>>> +     argv_array_push(&cmd, batch_arg.buf);
>>>>
>>>> +     strbuf_release(&batch_arg);
>>>
>>> I think I saw a suggestion to use xstrfmt() with free()  instead of
>>> the sequence of strbuf_init(), strbuf_addf(), and strbuf_release()
>>> in a similar but different context.  Perhaps we should follow suit
>>> here, too?
>>
>> Perhaps I'm missing something obvious, but wouldn't argv_array_pushf()
>> be even simpler?
>>
>>     argv_array_pushf(&cmd, "--batch-size=%"PRIuMAX,
>>         (uintmax_t)get_auto_pack_size());
> 
> No, it was me who was missing an obvious and better alternative.

Today I learned about arv_array_push. Thanks!

-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-24 12:51         ` Derrick Stolee
@ 2020-07-24 19:39           ` Junio C Hamano
  2020-07-25  1:46           ` Taylor Blau
  1 sibling, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 19:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> OK, my attempt has led to this final table:
>
> 	const struct maintenance_task default_tasks[] = {
> 		{
> 			"prefetch",
> 			maintenance_task_prefetch,
> 		},
>...
> 		{
> 			"commit-graph",
> 			maintenance_task_commit_graph,
> 			should_write_commit_graph,
> 		}
> 	};
> 	num_tasks = sizeof(default_tasks) / sizeof(struct maintenance_task);
>
> This is followed by allocating and copying the data to the
> 'tasks' array, allowing it to be sorted and modified according
> to command-line arguments and config.
>
> Is this what you intended?

I do not know how important it is for your overall design to keep
the blueprint/master-copy table that is separate from the working
copy of the table that gets sorted, enabled/chosen bit set, etc.
IIUC, you were modifying the entries' fields at runtime, so perhaps
a pristine copy is not all that important (in which case you can
just lose "const" and do without extra copying)?  I dunno.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-24 13:09       ` Derrick Stolee
@ 2020-07-24 19:47         ` Junio C Hamano
  2020-07-25  1:52           ` Taylor Blau
  0 siblings, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 19:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> But you are discussing here how the _behavior_ can change when
> --auto is specified. And specifically, "git gc --auto" really
> meant "This is running after a foreground command, so only do
> work if necessary and do it quickly to minimize blocking time."
>
> I'd be happy to replace "--auto" with "--quick" in the
> maintenance builtin.
>
> This opens up some extra design space for how the individual
> tasks perform depending on "--quick" being specified or not.
> My intention was to create tasks that are already in "quick"
> mode:
>
> * loose-objects have a maximum batch size.
> * incremental-repack is capped in size.
> * commit-graph uses the --split option.
>
> But this "quick" distinction might be important for some of
> the tasks we intend to extract from the gc builtin.

Yup.  To be honest, I came to this topic from a completely different
direction.  The field name "auto" alone (and no other field name)
had to have an extra cruft (i.e. "_flag") attached to it, which is
understandable but ugly.  Then I started thinking if 'auto(matic)'
is really the right word to describe what we want out of the option,
and came to the realization that there may be better words.

> Since the tasks are frequently running subcommands, returning
> 0 for success and non-zero for error matches the error codes
> returned by those subcommands.

As long as these will _never_ be called from other helper functions
but from the cmd_foo() top-level and their return values are only
used directly as the top-level's return value, I do not mind too
much.

But whenever I am writing such a code, I find myself not brave
enough to make such a bold promise (I saw other people call the
helpers I wrote in unintended ways and had to adjust the semantics
of them to accomodate the new callers too many times), so I'd rather
see the caller do "return !!helper_fn()" to allow helper_fn() to be
written more naturally (e.g. letting them return error(...)).

Thanks.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 06/18] maintenance: add --task option
  2020-07-24 13:36       ` Derrick Stolee
@ 2020-07-24 19:50         ` Junio C Hamano
  0 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 19:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>>>  	for (i = 0; !result && i < num_tasks; i++) {
>>> -		if (!tasks[i]->enabled)
>>> +		if (opts.tasks_selected && !tasks[i]->selected)
>>> +			continue;
>>> +
>>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>>  			continue;
>> 
>> I am not sure about this.  Even if the task <x> is disabled, if the
>> user says --task=<x>, it is run anyway?  Doesn't make an immediate
>> sense to me.
>>  ...
>> 		if (!tasks[i]->enabled ||
>> 		    !(!opts.tasks_selected || tasks[i]->selected))
>> 			continue;
>
> This isn't quite right, due to the confusing nature of "enabled".

Yes, in the message you are responding to, I was still assuming that
--task=foo that defeat task.foo.enabled=no would be a bug.  If we
want to run disabled tasks by selection, of course the condition
would need to change.


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 22:15     ` Junio C Hamano
  2020-07-23 23:09       ` Eric Sunshine
@ 2020-07-24 19:51       ` Derrick Stolee
  2020-07-24 20:17         ` Junio C Hamano
  1 sibling, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-24 19:51 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 6:15 PM, Junio C Hamano wrote:
> It might be too late to ask this now, but how does the quality of
> the resulting combined pack ensured, wrt locality and deltification?

There are two questions here, really.

The first is: given the set of objects to pack, are we packing
them as efficiently as possible?

Since e11d86de139 (midx: teach "git multi-pack-index repack" honor
"git repack" configurations, 2020-05-10), the 'repack' subcommand
honors the configured recommendations for deltas. This includes:

 (requires updating the arguments to pack-objects)
 * repack.useDeltaBaseOffset
 * repack.useDeltaIslands

 (automatically respected by pack-objects)
 * repack.packKeptsObjects
 * pack.threads
 * pack.depth
 * pack.window
 * pack.windowMemory
 * pack.deltaCacheSize
 * pack.deltaCacheLimit

All of these config settings allow the user to specify how hard
to try for delta compression. If they know something about their
data or their tolerance for extra CPU time during pack-objects,
then they can get better deltas by changing these values.

The second question is "how well do the deltas compress when
only packing incrementally versus packing the entire repo?"

One important way to consider these things is how the pack-
files are created. If we expect most pack-files coming from
'git fetch' calls, then there are some interesting patterns
that arise.

I started measuring by creating a local clone of the Linux
kernel repo starting at v5.0 and then fetching an increment
of ten commits from the first-parent history of later tags.
Each fetch created a pack-file of ~300 MB relative to the
base pack-file of ~1.6 GB. Collecting ten of these in a row
leads to almost 2 GB of "fetched" packs.

However, keep in mind that we didn't fetch 2 GB of data
"across the wire" but instead expanded the thin pack into
a full pack by copying the base objects. After running the
incremental-repack step, that ~2 GB of data re-compresses
back down to one pack-file of size ~300 MB.

_Why_ did 10 pack-files all around 300 MB get repacked at
once? It's because there were duplicate objects across those
pack-files! Recall that the multi-pack-index repack computes
batch sizes by computing an "estimated pack size" by counting
how many objects in that pack-file are referenced by the
multi-pack-index, then computing

  expected size = actual size * num objects
                              / num referenced objects

In this case, the "base" objects that are copied between the
fetches already exist in these smaller pack-files. Thus, when
the batch-size is ~300 MB it still repacks all 10 "small"
packs into a new pack that is still ~300 MB.

Now, this is still a little wasteful. That second pack has
a significant "extra space" cost. However, it came at a bonus
of writing much less data.

Perhaps the Linux kernel repository is just too small to care
about this version of maintenance? In such a case, I can work
to introduce a 'full-repack' task that is more aggressive with
repacking all pack-files. This could use the multi-pack-index
repack with --batch-size=0 to still benefit from the
repack/expire trick for safe concurrency.

Other ideas are to try repacking in other ways, such as by
object type, to maximize easy wins. For example, perhaps we
repack all of the commits and trees every time, but leave the
blobs to be repacked when we are ready to spend time on
removing deltas?

I think the incremental-repack has value, and perhaps it is
isolated to super-huge repositories. That can be controlled
by limiting its use to those when an expert user configures
Git to use it.

I remain open to recommendations from others with more
experience with delta compression to recommend alternatives.

tl,dr: the incremental-repack isn't the most space-efficient
thing we can do, and that's by design.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-24 14:50       ` Derrick Stolee
@ 2020-07-24 19:57         ` Junio C Hamano
  0 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 19:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>  * In a partial clone, it is likely that we get loose objects only
>    due to a command like "git log -p" that downloads blobs
>    one-by-one. In such a case, this step coming in later and picking
>    up those blobs _will_ find good deltas because they are present
>    in the same batch.
>
>  * (I know this case isn't important to core Git, but please indulge
>    me) In a VFS for Git repo, the loose objects correspond to blobs
>    that were faulted in by a virtual filesystem read. In this case,
>    the blobs are usually from a single commit in history, so good
>    deltas between the blobs don't actually exist!

Let me stop here by saying that I am now starting to worry about
overfitting the repacking strategy to lazy clone repositories.  I am
perfectly fine with the plan to start with just one strategy overfit
for partially cloned repositories, as long as we make sure that we
can be extended it to suit other access/object acquisition patterns.




^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-24 19:51       ` Derrick Stolee
@ 2020-07-24 20:17         ` Junio C Hamano
  0 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 20:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> tl,dr: the incremental-repack isn't the most space-efficient
> thing we can do, and that's by design.

I am not very surprised by the fact that many packfiles that were
obtained by thin pack transfer have many duplicate objects (due to
having to include the delta bases), and it is natural to expect that
deduplication would save many bytes.  It's not all that interesting.

I am more interested in making sure that we can assure that in the
combined single pack, (1) objects are ordered for good locality of
access and (2) objects are getting good delta compression.


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-24 14:25       ` Derrick Stolee
@ 2020-07-24 20:47         ` Junio C Hamano
  0 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-24 20:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> I don't anticipate users specifying --task=<task> very often, as
> it requires deep knowledge of the tasks. If a user _does_ use the
> option, then we should trust their order as they might have a
> good reason to choose that order.
>
> Generally, my philosophy is to provide expert users with flexible
> choices while creating sensible defaults for non-expert users.

Sounds sensible.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 01/18] maintenance: create basic maintenance runner
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
@ 2020-07-25  1:26     ` Taylor Blau
  2020-07-25  1:47     ` Đoàn Trần Công Danh
  2020-07-29 22:19     ` Jonathan Nieder
  2 siblings, 0 replies; 164+ messages in thread
From: Taylor Blau @ 2020-07-25  1:26 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>

Everything in this patch looks very sensible to me, and I think that
implementing 'git maintenance' in the same file as the current 'git gc'
builtin makes a lot of sense to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-23 20:21     ` Junio C Hamano
@ 2020-07-25  1:33       ` Taylor Blau
  2020-07-30 13:29       ` Derrick Stolee
  1 sibling, 0 replies; 164+ messages in thread
From: Taylor Blau @ 2020-07-25  1:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

On Thu, Jul 23, 2020 at 01:21:55PM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +--[no-]maintenance::
> >  --[no-]auto-gc::
> > -	Run `git gc --auto` at the end to perform garbage collection
> > -	if needed. This is enabled by default.
> > +	Run `git maintenance run --auto` at the end to perform garbage
> > +	collection if needed. This is enabled by default.
>
> Shouldn't the new synonym be called --auto-maintenance or an
> abbreviation thereof?  It is not like we will run the full
> maintenance suite when "--no-maintenance" is omitted, which
> certainly is not the impression we want to give our readers.
>
> >  These objects may be removed by normal Git operations (such as `git commit`)
> > -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
> > -If these objects are removed and were referenced by the cloned repository,
> > -then the cloned repository will become corrupt.
> > +which automatically call `git maintenance run --auto` and `git gc --auto`.
>
> Hmph.  Perhaps the picture may change in the end of the series but I
> got an impression that "gc --auto" would eventually become just part
> of "maintenance --auto" and the users won't have to be even aware of
> its existence?  Wouldn't we rather want to say something like
>
> 	--[no-]auto-maintenance::
> 	--[no-]auto-gc::
>                 Run `git maintenance run --auto` at the end to perform
>                 garbage collection if needed (`--[no-]auto-gc` is a
>                 synonym).  This is enabled by default.
>
> > diff --git a/builtin/fetch.c b/builtin/fetch.c
> > index 82ac4be8a5..49a4d727d4 100644
> > --- a/builtin/fetch.c
> > +++ b/builtin/fetch.c
> > @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
> >  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
> >  			N_("report that we have only objects reachable from this object")),
> >  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
>
> > +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
> > +		 N_("run 'maintenance --auto' after fetching")),
> >  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
> > +		 N_("run 'maintenance --auto' after fetching")),
>
> OK, so this is truly a backward-compatible synonym at this point.

I wouldn't be opposed to making the 'auto-gc' option an
'OPT_HIDDEN_BOOL', but I realize that users may not want to move as
quickly as that. Perhaps we should visit this in a couple of releases
(or perhaps you are getting to it in a later patch that I haven't read
yet).

> > diff --git a/run-command.c b/run-command.c
> > index 9b3a57d1e3..82ad241638 100644
> > --- a/run-command.c
> > +++ b/run-command.c
> > @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
> >  	return result;
> >  }
> >
> > -int run_auto_gc(int quiet)
> > +int run_auto_maintenance(int quiet)
> >  {
> >  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
> >  	int status;
> >
> > -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
> > +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
> >  	if (quiet)
> >  		argv_array_push(&argv_gc_auto, "--quiet");
> > +	else
> > +		argv_array_push(&argv_gc_auto, "--no-quiet");
> > +
> >  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
> >  	argv_array_clear(&argv_gc_auto);
> >  	return status;
>
> Don't we want to replace all _gc_ with _maintenance_ in this
> function?  I think the first business before we can do so would be
> to rethink if spelling out "maintenance" fully in code is a good
> idea in the first space.  It would make names for variables,
> structures and fields unnecessarily long without contributing to
> ease of understanding an iota, and a easy-to-remember short-form or
> an abbreviation may be needed.  Using a short-form/abbreviation
> wouldn't worsen the end-user experience, and not the developer
> experience for that matter.
>
> If we choose "gc" as the short-hand, most of the change in this step
> would become unnecessary.  I also do not mind if we some other words
> or word-fragment (perhaps "maint"???) is chosen.

Yeah, writing out 'maintenance' every time in the code and in
command-line arguments is kind of a mouthful. I'm more willing to accept
that '--maintenance' is something that users would write or script
around, but 'maint' makes sense to me as a shorthand in the code.

I could go either way on calling the command-line flag '--maint',
though.

>
> > diff --git a/run-command.h b/run-command.h
> > index 191dfcdafe..d9a800e700 100644
> > --- a/run-command.h
> > +++ b/run-command.h
> > @@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
> >  /*
> >   * Trigger an auto-gc
> >   */
> > -int run_auto_gc(int quiet);
> > +int run_auto_maintenance(int quiet);
> >
> >  #define RUN_COMMAND_NO_STDIN 1
> >  #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
> > diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> > index a66dbe0bde..9850ecde5d 100755
> > --- a/t/t5510-fetch.sh
> > +++ b/t/t5510-fetch.sh
> > @@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
> >  		git config fetch.unpackLimit 1 &&
> >  		git config gc.autoPackLimit 1 &&
> >  		git config gc.autoDetach false &&
> > -		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
> > +		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
> >  		test_i18ngrep "Auto packing the repository" fetch.out &&
> >  		! grep "Should I try again" fetch.out
> >  	)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-23 17:56   ` [PATCH v2 08/18] maintenance: add prefetch task Derrick Stolee via GitGitGadget
  2020-07-23 20:53     ` Junio C Hamano
@ 2020-07-25  1:37     ` Đoàn Trần Công Danh
  2020-07-25  1:48       ` Junio C Hamano
  1 sibling, 1 reply; 164+ messages in thread
From: Đoàn Trần Công Danh @ 2020-07-25  1:37 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	phillip.wood123, emilyshaffer, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 2020-07-23 17:56:30+0000, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> The task is called "prefetch" because it is work done in advance
> of a foreground fetch to make that 'git fetch' command much faster.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:

Does this job interfere with FETCH_HEAD?
From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
it looks like yes.

I (and some other people, probably) rely on FETCH_HEAD for our scripts.
Hence, it would be nice to not touch FETCH_HEAD with prefetch job.

Thanks,
-Danh

> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/prefetch/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-maintenance.txt | 12 ++++++
>  builtin/gc.c                      | 64 ++++++++++++++++++++++++++++++-
>  t/t7900-maintenance.sh            | 24 ++++++++++++
>  3 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects
> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote
> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
> +
>  gc::
>  	Cleanup unnecessary files and optimize the local repository. "GC"
>  	stands for "garbage collection," but this task performs many
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 5d99b4b805..969c127877 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -28,6 +28,7 @@
>  #include "blob.h"
>  #include "tree.h"
>  #include "promisor-remote.h"
> +#include "remote.h"
>  
>  #define FAILED_RUN "failed to run %s"
>  
> @@ -700,7 +701,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	return 0;
>  }
>  
> -#define MAX_NUM_TASKS 2
> +#define MAX_NUM_TASKS 3
>  
>  static const char * const builtin_maintenance_usage[] = {
>  	N_("git maintenance run [<options>]"),
> @@ -781,6 +782,63 @@ static int maintenance_task_commit_graph(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	if (opts.quiet)
> +		argv_array_push(&cmd, "--quiet");
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_each_remote(struct remote *remote, void *cbdata)
> +{
> +	struct string_list *remotes = (struct string_list *)cbdata;
> +
> +	string_list_append(remotes, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c09a9eb90b..8b04a04c79 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -44,4 +44,28 @@ test_expect_success 'run --task duplicate' '
>  	test_i18ngrep "cannot be selected multiple times" err
>  '
>  
> +test_expect_success 'run --task=prefetch with no remotes' '
> +	git maintenance run --task=prefetch 2>err &&
> +	test_must_be_empty err
> +'
> +
> +test_expect_success 'prefetch multiple remotes' '
> +	git clone . clone1 &&
> +	git clone . clone2 &&
> +	git remote add remote1 "file://$(pwd)/clone1" &&
> +	git remote add remote2 "file://$(pwd)/clone2" &&
> +	git -C clone1 switch -c one &&
> +	git -C clone2 switch -c two &&
> +	test_commit -C clone1 one &&
> +	test_commit -C clone2 two &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch &&
> +	grep ",\"fetch\",\"remote1\"" run-prefetch.txt &&
> +	grep ",\"fetch\",\"remote2\"" run-prefetch.txt &&
> +	test_path_is_missing .git/refs/remotes &&
> +	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
> +	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
> +	git log prefetch/remote1/one &&
> +	git log prefetch/remote2/two
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

-- 
Danh

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-24 12:51         ` Derrick Stolee
  2020-07-24 19:39           ` Junio C Hamano
@ 2020-07-25  1:46           ` Taylor Blau
  1 sibling, 0 replies; 164+ messages in thread
From: Taylor Blau @ 2020-07-25  1:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On Fri, Jul 24, 2020 at 08:51:32AM -0400, Derrick Stolee wrote:
> On 7/24/2020 8:23 AM, Derrick Stolee wrote:
> > On 7/23/2020 3:57 PM, Junio C Hamano wrote:
> >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>
> >>> +static void initialize_tasks(void)
> >>> +{
> >>> +	int i;
> >>> +	num_tasks = 0;
> >>> +
> >>> +	for (i = 0; i < MAX_NUM_TASKS; i++)
> >>> +		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
> >>> +
> >>> +	tasks[num_tasks]->name = "gc";
> >>> +	tasks[num_tasks]->fn = maintenance_task_gc;
> >>> +	tasks[num_tasks]->enabled = 1;
> >>> +	num_tasks++;
> >>
> >> Are we going to have 47 different tasks initialized by code like
> >> this in the future?  I would have expected that you'd have a table
> >> of tasks that serves as the blueprint copy and copy it to the table
> >> to be used if there is some need to mutate the table-to-be-used.
> >
> > Making it a table will likely make it easier to read. I hadn't
> > thought of it.
> >
> > At the start, I thought that the diff would look awful as we add
> > members to the struct. However, the members that are not specified
> > are set to zero, so I should be able to craft this into something
> > not too terrible.
>
> OK, my attempt has led to this final table:
>
> 	const struct maintenance_task default_tasks[] = {
> 		{
> 			"prefetch",
> 			maintenance_task_prefetch,
> 		},
> 		{
> 			"loose-objects",
> 			maintenance_task_loose_objects,
> 			loose_object_auto_condition,
> 		},
> 		{
> 			"incremental-repack",
> 			maintenance_task_incremental_repack,
> 			incremental_repack_auto_condition,
> 		},
> 		{
> 			"gc",
> 			maintenance_task_gc,
> 			need_to_gc,
> 			1,
> 		},
> 		{
> 			"commit-graph",
> 			maintenance_task_commit_graph,
> 			should_write_commit_graph,
> 		}
> 	};
> 	num_tasks = sizeof(default_tasks) / sizeof(struct maintenance_task);
>
> This is followed by allocating and copying the data to the
> 'tasks' array, allowing it to be sorted and modified according
> to command-line arguments and config.
>
> Is this what you intended?

I'm not sure if Junio intended what I'm going to suggest, but I think
that you could make looking up these "blueprint" tasks a little easier
by using the designated index initializer. For what it's worth, I wasn't
sure if we allow this in the codebase, but some quick perusing through
Documentation/CodingGuidelines turns up 512f41cfac (clean.c: use
designated initializer, 2017-07-14), which does use this style.

Maybe something like:

  enum maintenance_task_kind {
    TASK_PREFETCH = 0,
    TASK_LOOSE_OBJECTS,
    /* ... */
    TASK__COUNT
  };

  const struct maintenance_task default_tasks[TASK__COUNT] = {
    [TASK_PREFETCH] = {
      "prefetch",
      maintenance_task_prefetch,
    },
    [...] = ...
  };

and then you should be able to pick those out with
'default_tasks[TASK_PREFETCH]'. I'm not sure if you are able to rely on
those tasks appearing in a certain order in which case you can feel free
to discard this suggestion.

If nothing else, I'm glad that we can use the '[...] = '-style
initializers :-).

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 01/18] maintenance: create basic maintenance runner
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
  2020-07-25  1:26     ` Taylor Blau
@ 2020-07-25  1:47     ` Đoàn Trần Công Danh
  2020-07-29 22:19     ` Jonathan Nieder
  2 siblings, 0 replies; 164+ messages in thread
From: Đoàn Trần Công Danh @ 2020-07-25  1:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	phillip.wood123, emilyshaffer, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 2020-07-23 17:56:23+0000, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> new file mode 100755
> index 0000000000..d00641c4dd
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'help text' '
> +	test_must_fail git maintenance -h 2>err &&
> +	test_i18ngrep "usage: git maintenance run" err
> +'

I think this test has been tested better already in t0012,
Anyway, if we would like to check the word "git maintenance run"
explicitly here, it would be clearer to

s/test_must_fail/test_expect_code 129/

> +
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> +'
> +
> +test_done
> -- 
> gitgitgadget
> 

-- 
Danh

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-25  1:37     ` Đoàn Trần Công Danh
@ 2020-07-25  1:48       ` Junio C Hamano
  2020-07-27 14:07         ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-25  1:48 UTC (permalink / raw)
  To: Đoàn Trần Công Danh
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

>> When running 'git fetch <remote> <options>' in the background, use
>> the following options for careful updating:
>
> Does this job interfere with FETCH_HEAD?
> From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
> it looks like yes.
>
> I (and some other people, probably) rely on FETCH_HEAD for our scripts.
> Hence, it would be nice to not touch FETCH_HEAD with prefetch job.

Very good point.  For that, Derrick may want to swallow the single
patch from 'jc/no-update-fetch-head' topic into this series and
pass the new command line option.





^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-24 19:47         ` Junio C Hamano
@ 2020-07-25  1:52           ` Taylor Blau
  2020-07-30 13:59             ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Taylor Blau @ 2020-07-25  1:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On Fri, Jul 24, 2020 at 12:47:00PM -0700, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
> > But you are discussing here how the _behavior_ can change when
> > --auto is specified. And specifically, "git gc --auto" really
> > meant "This is running after a foreground command, so only do
> > work if necessary and do it quickly to minimize blocking time."
> >
> > I'd be happy to replace "--auto" with "--quick" in the
> > maintenance builtin.
> >
> > This opens up some extra design space for how the individual
> > tasks perform depending on "--quick" being specified or not.
> > My intention was to create tasks that are already in "quick"
> > mode:
> >
> > * loose-objects have a maximum batch size.
> > * incremental-repack is capped in size.
> > * commit-graph uses the --split option.
> >
> > But this "quick" distinction might be important for some of
> > the tasks we intend to extract from the gc builtin.
>
> Yup.  To be honest, I came to this topic from a completely different
> direction.  The field name "auto" alone (and no other field name)
> had to have an extra cruft (i.e. "_flag") attached to it, which is
> understandable but ugly.  Then I started thinking if 'auto(matic)'
> is really the right word to describe what we want out of the option,
> and came to the realization that there may be better words.

I wonder what the quick and slow paths are here. For the commit-graph
code, what you wrote here seems to match what I'd expect with passing
'--auto' in the sense of running 'git gc'. That is, I'm leaving it up to
the commit-graph machinery's idea of the normal '--split' rules to
figure out when to roll up layers of a commit-graph, as opposed to
creating a new layer and extending the chain.

So, I think that makes sense if the caller gave '--auto'. But, I'm not
sure that it makes sense if they didn't, in which case I'd imagine
something quicker to happen. There, I'd expect something more like:

  1. Run 'git commit-graph write --reachable --split=no-merge'.
  2. Run 'git commit-graph verify'.
  3. If 'git commit-graph verify' failed, drop the existing commit graph
     and rebuild it with 'git commit-graph --reachable --split=replace'.
  4. Otherwise, do nothing.

I'm biased, of course, but I think that that matches roughly what I'd
expect to happen in the fast/slow path. Granted, the steps to rebuild
the commit graph are going to be slow no matter what (depending on the
size of the repository), and so in that case maybe the commit-graph
should just be dropped. I'm not really sure what to do about that...

> > Since the tasks are frequently running subcommands, returning
> > 0 for success and non-zero for error matches the error codes
> > returned by those subcommands.
>
> As long as these will _never_ be called from other helper functions
> but from the cmd_foo() top-level and their return values are only
> used directly as the top-level's return value, I do not mind too
> much.
>
> But whenever I am writing such a code, I find myself not brave
> enough to make such a bold promise (I saw other people call the
> helpers I wrote in unintended ways and had to adjust the semantics
> of them to accomodate the new callers too many times), so I'd rather
> see the caller do "return !!helper_fn()" to allow helper_fn() to be
> written more naturally (e.g. letting them return error(...)).
>
> Thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-25  1:48       ` Junio C Hamano
@ 2020-07-27 14:07         ` Derrick Stolee
  2020-07-27 16:13           ` Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-27 14:07 UTC (permalink / raw)
  To: Junio C Hamano, Đoàn Trần Công Danh
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/24/2020 9:48 PM, Junio C Hamano wrote:
> Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
> 
>>> When running 'git fetch <remote> <options>' in the background, use
>>> the following options for careful updating:
>>
>> Does this job interfere with FETCH_HEAD?
>> From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
>> it looks like yes.
>>
>> I (and some other people, probably) rely on FETCH_HEAD for our scripts.
>> Hence, it would be nice to not touch FETCH_HEAD with prefetch job.
> 
> Very good point.  For that, Derrick may want to swallow the single
> patch from 'jc/no-update-fetch-head' topic into this series and
> pass the new command line option.

Thanks for the pointer! I appreciate the attention to detail here.

I'll rebase onto jc/no-update-fetch-head for the next version, since
that branch is based on v2.28.0-rc0, which is recent enough.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-27 14:07         ` Derrick Stolee
@ 2020-07-27 16:13           ` Junio C Hamano
  2020-07-27 18:27             ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Junio C Hamano @ 2020-07-27 16:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> I'll rebase onto jc/no-update-fetch-head for the next version, since
> that branch is based on v2.28.0-rc0, which is recent enough.

I do not think it is wise to base a work on top of unfinished "you
could do it this way, perhaps?" demonstration patch the original
author does not have much inclination to finish, though.

When I am really bored, I may go back to the topic to finish it, but
I wouldn't mind if you took ownership of it at all.

Thanks.




^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 08/18] maintenance: add prefetch task
  2020-07-27 16:13           ` Junio C Hamano
@ 2020-07-27 18:27             ` Derrick Stolee
  2020-07-28 16:37               ` [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-27 18:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/27/2020 12:13 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>> that branch is based on v2.28.0-rc0, which is recent enough.
> 
> I do not think it is wise to base a work on top of unfinished "you
> could do it this way, perhaps?" demonstration patch the original
> author does not have much inclination to finish, though.
> 
> When I am really bored, I may go back to the topic to finish it, but
> I wouldn't mind if you took ownership of it at all.

Ah. I didn't understand the status of that branch. I'll pull it in
to this topic.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update
  2020-07-27 18:27             ` Derrick Stolee
@ 2020-07-28 16:37               ` Junio C Hamano
  2020-07-29  9:12                 ` Phillip Wood
  2020-07-30 15:17                 ` Derrick Stolee
  0 siblings, 2 replies; 164+ messages in thread
From: Junio C Hamano @ 2020-07-28 16:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>> 
>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>> that branch is based on v2.28.0-rc0, which is recent enough.
>> 
>> I do not think it is wise to base a work on top of unfinished "you
>> could do it this way, perhaps?" demonstration patch the original
>> author does not have much inclination to finish, though.
>> 
>> When I am really bored, I may go back to the topic to finish it, but
>> I wouldn't mind if you took ownership of it at all.
>
> Ah. I didn't understand the status of that branch. I'll pull it in
> to this topic.

So here is with one of the two things that I found missing in the
first iteration of the patch: documentation.

The other thing that I found iffy (and still missing from this
version) was what should be done when "git pull" is explicitly given
the "--no-write-fetch-head" option.

I think (but didn't check the recent code) that 'git pull' would
pass only known-to-make-sense command line options to underlying
'git fetch', so it probably will barf with "unknown option", which
is the best case.  We might want to make it sure with a new test in
5521.  On the other hand, if we get anything other than "no such
option", we may want to think if we want to "fix" it or just leave
it inside "if it hurts, don't do it" territory.

Thanks.  

The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
but this round has not been.

-- >8 --

If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head"
and "fetch.writeFetchHEAD" configuration variable.  Without either,
the default is to write FETCH_HEAD, and the usual rule that the
command line option defeats configured default applies.

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--fetch-write-head` does not force `git
fetch` to write the file.

Also note that this option is explicitly passed when "git pull"
internally invokes "git fetch", so that those who configured their
"git fetch" not to write FETCH_HEAD would not be able to break the
cooperation between these two commands.  "git pull" must see what
"git fetch" got recorded in FETCH_HEAD to work correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/fetch.txt  |  7 ++++++
 Documentation/fetch-options.txt | 10 +++++++++
 builtin/fetch.c                 | 19 +++++++++++++---
 builtin/pull.c                  |  3 ++-
 t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
index b20394038d..0aaa05e8c0 100644
--- a/Documentation/config/fetch.txt
+++ b/Documentation/config/fetch.txt
@@ -91,3 +91,10 @@ fetch.writeCommitGraph::
 	merge and the write may take longer. Having an updated commit-graph
 	file helps performance of many Git commands, including `git merge-base`,
 	`git push -f`, and `git log --graph`. Defaults to false.
+
+fetch.writeFetchHEAD::
+	Setting it to false tells `git fetch` not to write the list
+	of remote refs fetched in the `FETCH_HEAD` file directly
+	under `$GIT_DIR`.  Can be countermanded from the command
+	line with the `--[no-]write-fetch-head` option.  Defaults to
+	true.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 6e2a160a47..6775e8499f 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
 --dry-run::
 	Show what would be done, without making any changes.
 
+ifndef::git-pull[]
+--[no-]write-fetch-head::
+	Write the list of remote refs fetched in the `FETCH_HEAD`
+	file directly under `$GIT_DIR`.  This is the default unless
+	the configuration variable `fetch.writeFetchHEAD` is set to
+	false.  Passing `--no-write-fetch-head` from the command
+	line tells Git not to write the file.  Under `--dry-run`
+	option, the file is never written.
+endif::git-pull[]
+
 -f::
 --force::
 	When 'git fetch' is used with `<src>:<dst>` refspec it may
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 82ac4be8a5..3ccf69753f 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok;
+static int write_fetch_head = 1;
 static int verbosity, deepen_relative, set_upstream;
 static int progress = -1;
 static int enable_auto_gc = 1;
@@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writefetchhead")) {
+		write_fetch_head = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
+	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
+		 N_("write fetched references to the FETCH_HEAD file")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
 	OPT_BOOL('u', "update-head-ok", &update_head_ok,
 		    N_("allow updating of HEAD ref")),
@@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *what, *kind;
 	struct ref *rm;
 	char *url;
-	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
+	const char *filename = (!write_fetch_head
+				? "/dev/null"
+				: git_path_fetch_head(the_repository));
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
 
@@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
 	}
 
 	/* if not appending, truncate FETCH_HEAD */
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		retcode = truncate_fetch_head();
 		if (retcode)
 			goto cleanup;
@@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
 	int i, result = 0;
 	struct argv_array argv = ARGV_ARRAY_INIT;
 
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	/* FETCH_HEAD never gets updated in --dry-run mode */
+	if (dry_run)
+		write_fetch_head = 0;
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/builtin/pull.c b/builtin/pull.c
index 8159c5d7c9..e988d92b53 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
 	struct argv_array args = ARGV_ARRAY_INIT;
 	int ret;
 
-	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
+	argv_array_pushl(&args, "fetch", "--update-head-ok",
+			 "--write-fetch-head", NULL);
 
 	/* Shared options */
 	argv_push_verbosity(&args);
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index a66dbe0bde..3052c2d8d5 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
 
 '
 
-test_expect_success 'fetch --dry-run' '
-
+test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
 	rm -f .git/FETCH_HEAD &&
 	git fetch --dry-run . &&
 	! test -f .git/FETCH_HEAD
 '
 
+test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success '--write-fetch-head gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --dry-run --write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
+	test -f .git/FETCH_HEAD
+'
+
 test_expect_success "should be able to fetch with duplicate refspecs" '
 	mkdir dups &&
 	(
-- 
2.28.0


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-23 17:56   ` [PATCH v2 05/18] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
  2020-07-23 20:22     ` Junio C Hamano
@ 2020-07-29  0:22     ` Jeff King
  1 sibling, 0 replies; 164+ messages in thread
From: Jeff King @ 2020-07-29  0:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:27PM +0000, Derrick Stolee via GitGitGadget wrote:

> +static int run_write_commit_graph(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "write",
> +			 "--split", "--reachable", NULL);
> +
> +	if (opts.quiet)
> +		argv_array_push(&cmd, "--no-progress");
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	argv_array_clear(&cmd);
> +
> +	return result;
> +}

This is a pretty minor nit, but since I happened to be looking at the
merge of all of the recent argv_array callers today... :)

You can write this a bit more succinctly by reusing the argv_array
provided by the child_process:

  struct child_process cmd = CHILD_PROCESS_INIT;

  cmd.git_cmd = 1;
  argv_array_pushl(&cmd.args, "commit-graph", "write",
                   "--split", "--reachable", NULL);

  if (opts.quiet)
          argv_array_push(&cmd.args, "--no-progress");

  return run_command(&cmd);

Then you don't have to worry about freeing the argv memory, because it's
handled automatically.

Like I said, quite minor, but it looks like this pattern appears in a
few places, so it might be worth tweaking. And it would still work with
the "pushf" people recommended to avoid extra strbufs (I saw another one
in fetch_remote(), too).

-Peff

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update
  2020-07-28 16:37               ` [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano
@ 2020-07-29  9:12                 ` Phillip Wood
  2020-07-29  9:17                   ` Phillip Wood
  2020-07-30 15:17                 ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Phillip Wood @ 2020-07-29  9:12 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: Taylor Blau, Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 28/07/2020 17:37, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>
>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>
>>> I do not think it is wise to base a work on top of unfinished "you
>>> could do it this way, perhaps?" demonstration patch the original
>>> author does not have much inclination to finish, though.
>>>
>>> When I am really bored, I may go back to the topic to finish it, but
>>> I wouldn't mind if you took ownership of it at all.
>>
>> Ah. I didn't understand the status of that branch. I'll pull it in
>> to this topic.
> 
> So here is with one of the two things that I found missing in the
> first iteration of the patch: documentation.
> 
> The other thing that I found iffy (and still missing from this
> version) was what should be done when "git pull" is explicitly given
> the "--no-write-fetch-head" option.
> 
> I think (but didn't check the recent code) that 'git pull' would
> pass only known-to-make-sense command line options to underlying
> 'git fetch', so it probably will barf with "unknown option", which
> is the best case.  We might want to make it sure with a new test in
> 5521.  On the other hand, if we get anything other than "no such
> option", we may want to think if we want to "fix" it or just leave
> it inside "if it hurts, don't do it" territory.
> 
> Thanks.
> 
> The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
> but this round has not been.
> 
> -- >8 --
> 
> If you run fetch but record the result in remote-tracking branches,
> and either if you do nothing with the fetched refs (e.g. you are
> merely mirroring) or if you always work from the remote-tracking
> refs (e.g. you fetch and then merge origin/branchname separately),
> you can get away with having no FETCH_HEAD at all.
> 
> Teach "git fetch" a command line option "--[no-]write-fetch-head"
> and "fetch.writeFetchHEAD" configuration variable.  Without either,
> the default is to write FETCH_HEAD, and the usual rule that the
> command line option defeats configured default applies.
> 
> Note that under "--dry-run" mode, FETCH_HEAD is never written;
> otherwise you'd see list of objects in the file that you do not
> actually have.  Passing `--fetch-write-head` 

Typo, it should be `--write-fetch-head`

>does not force `git
> fetch` to write the file.
> 
> Also note that this option is explicitly passed when "git pull"
> internally invokes "git fetch", so that those who configured their
> "git fetch" not to write FETCH_HEAD would not be able to break the
> cooperation between these two commands.  "git pull" must see what
> "git fetch" got recorded in FETCH_HEAD to work correctly.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>   Documentation/config/fetch.txt  |  7 ++++++
>   Documentation/fetch-options.txt | 10 +++++++++
>   builtin/fetch.c                 | 19 +++++++++++++---
>   builtin/pull.c                  |  3 ++-
>   t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
>   5 files changed, 72 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
> index b20394038d..0aaa05e8c0 100644
> --- a/Documentation/config/fetch.txt
> +++ b/Documentation/config/fetch.txt
> @@ -91,3 +91,10 @@ fetch.writeCommitGraph::
>   	merge and the write may take longer. Having an updated commit-graph
>   	file helps performance of many Git commands, including `git merge-base`,
>   	`git push -f`, and `git log --graph`. Defaults to false.
> +
> +fetch.writeFetchHEAD::
> +	Setting it to false tells `git fetch` not to write the list
> +	of remote refs fetched in the `FETCH_HEAD` file directly
> +	under `$GIT_DIR`.  Can be countermanded from the command
> +	line with the `--[no-]write-fetch-head` option.  Defaults to
> +	true.
> diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
> index 6e2a160a47..6775e8499f 100644
> --- a/Documentation/fetch-options.txt
> +++ b/Documentation/fetch-options.txt
> @@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
>   --dry-run::
>   	Show what would be done, without making any changes.
>   
> +ifndef::git-pull[]
> +--[no-]write-fetch-head::
> +	Write the list of remote refs fetched in the `FETCH_HEAD`
> +	file directly under `$GIT_DIR`.  This is the default unless
> +	the configuration variable `fetch.writeFetchHEAD` is set to
> +	false.  Passing `--no-write-fetch-head` from the command
> +	line tells Git not to write the file.  Under `--dry-run`
> +	option, the file is never written.
> +endif::git-pull[]
> +
>   -f::
>   --force::
>   	When 'git fetch' is used with `<src>:<dst>` refspec it may
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 82ac4be8a5..3ccf69753f 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
>   #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
>   
>   static int all, append, dry_run, force, keep, multiple, update_head_ok;
> +static int write_fetch_head = 1;
>   static int verbosity, deepen_relative, set_upstream;
>   static int progress = -1;
>   static int enable_auto_gc = 1;
> @@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
>   		return 0;
>   	}
>   
> +	if (!strcmp(k, "fetch.writefetchhead")) {
> +		write_fetch_head = git_config_bool(k, v);
> +		return 0;
> +	}
>   	return git_default_config(k, v, cb);
>   }
>   
> @@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
>   		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
>   	OPT_BOOL(0, "dry-run", &dry_run,
>   		 N_("dry run")),
> +	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
> +		 N_("write fetched references to the FETCH_HEAD file")),
>   	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
>   	OPT_BOOL('u', "update-head-ok", &update_head_ok,
>   		    N_("allow updating of HEAD ref")),
> @@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
>   	const char *what, *kind;
>   	struct ref *rm;
>   	char *url;
> -	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
> +	const char *filename = (!write_fetch_head
> +				? "/dev/null"
> +				: git_path_fetch_head(the_repository));

I was suspicious of this as we haven't cleared write_fetch_head in the 
--dry-run case yet but the test below seems to show that we still don't 
write FETCH_HEAD if --dry-run is given. That makes we wonder what the 
point of setting the filename based on the value of write_fetch_head is.

The rest looks good to me, though it might be worth having a test to 
check that pull does indeed reject --no-write-fetch-head

Best Wishes

Phillip

>   	int want_status;
>   	int summary_width = transport_summary_width(ref_map);
>   
> @@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
>   	}
>   
>   	/* if not appending, truncate FETCH_HEAD */
> -	if (!append && !dry_run) {
> +	if (!append && write_fetch_head) {
>   		retcode = truncate_fetch_head();
>   		if (retcode)
>   			goto cleanup;
> @@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
>   	int i, result = 0;
>   	struct argv_array argv = ARGV_ARRAY_INIT;
>   
> -	if (!append && !dry_run) {
> +	if (!append && write_fetch_head) {
>   		int errcode = truncate_fetch_head();
>   		if (errcode)
>   			return errcode;
> @@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>   	if (depth || deepen_since || deepen_not.nr)
>   		deepen = 1;
>   
> +	/* FETCH_HEAD never gets updated in --dry-run mode */
> +	if (dry_run)
> +		write_fetch_head = 0;
> +
>   	if (all) {
>   		if (argc == 1)
>   			die(_("fetch --all does not take a repository argument"));
> diff --git a/builtin/pull.c b/builtin/pull.c
> index 8159c5d7c9..e988d92b53 100644
> --- a/builtin/pull.c
> +++ b/builtin/pull.c
> @@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
>   	struct argv_array args = ARGV_ARRAY_INIT;
>   	int ret;
>   
> -	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
> +	argv_array_pushl(&args, "fetch", "--update-head-ok",
> +			 "--write-fetch-head", NULL);
>   
>   	/* Shared options */
>   	argv_push_verbosity(&args);
> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> index a66dbe0bde..3052c2d8d5 100755
> --- a/t/t5510-fetch.sh
> +++ b/t/t5510-fetch.sh
> @@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
>   
>   '
>   
> -test_expect_success 'fetch --dry-run' '
> -
> +test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
>   	rm -f .git/FETCH_HEAD &&
>   	git fetch --dry-run . &&
>   	! test -f .git/FETCH_HEAD
>   '
>   
> +test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
> +	rm -f .git/FETCH_HEAD &&
> +	git fetch --no-write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success '--write-fetch-head gets defeated by --dry-run' '
> +	rm -f .git/FETCH_HEAD &&
> +	git fetch --dry-run --write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=no fetch . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
> +	test -f .git/FETCH_HEAD
> +'
> +
>   test_expect_success "should be able to fetch with duplicate refspecs" '
>   	mkdir dups &&
>   	(
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update
  2020-07-29  9:12                 ` Phillip Wood
@ 2020-07-29  9:17                   ` Phillip Wood
  0 siblings, 0 replies; 164+ messages in thread
From: Phillip Wood @ 2020-07-29  9:17 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: Taylor Blau, Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 29/07/2020 10:12, Phillip Wood wrote:
> On 28/07/2020 17:37, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>>> Derrick Stolee <stolee@gmail.com> writes:
>>>>
>>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>>
>>>> I do not think it is wise to base a work on top of unfinished "you
>>>> could do it this way, perhaps?" demonstration patch the original
>>>> author does not have much inclination to finish, though.
>>>>
>>>> When I am really bored, I may go back to the topic to finish it, but
>>>> I wouldn't mind if you took ownership of it at all.
>>>
>>> Ah. I didn't understand the status of that branch. I'll pull it in
>>> to this topic.
>>
>> So here is with one of the two things that I found missing in the
>> first iteration of the patch: documentation.
>>
>> The other thing that I found iffy (and still missing from this
>> version) was what should be done when "git pull" is explicitly given
>> the "--no-write-fetch-head" option.
>>
>> I think (but didn't check the recent code) that 'git pull' would
>> pass only known-to-make-sense command line options to underlying
>> 'git fetch', so it probably will barf with "unknown option", which
>> is the best case.  We might want to make it sure with a new test in
>> 5521.  On the other hand, if we get anything other than "no such
>> option", we may want to think if we want to "fix" it or just leave
>> it inside "if it hurts, don't do it" territory.
>>
>> Thanks.
>>
>> The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
>> but this round has not been.
>>
>> -- >8 --
>>
>> If you run fetch but record the result in remote-tracking branches,
>> and either if you do nothing with the fetched refs (e.g. you are
>> merely mirroring) or if you always work from the remote-tracking
>> refs (e.g. you fetch and then merge origin/branchname separately),
>> you can get away with having no FETCH_HEAD at all.
>>
>> Teach "git fetch" a command line option "--[no-]write-fetch-head"
>> and "fetch.writeFetchHEAD" configuration variable.  Without either,
>> the default is to write FETCH_HEAD, and the usual rule that the
>> command line option defeats configured default applies.
>>
>> Note that under "--dry-run" mode, FETCH_HEAD is never written;
>> otherwise you'd see list of objects in the file that you do not
>> actually have.  Passing `--fetch-write-head` 
> 
> Typo, it should be `--write-fetch-head`
> 
>> does not force `git
>> fetch` to write the file.
>>
>> Also note that this option is explicitly passed when "git pull"
>> internally invokes "git fetch", so that those who configured their
>> "git fetch" not to write FETCH_HEAD would not be able to break the
>> cooperation between these two commands.  "git pull" must see what
>> "git fetch" got recorded in FETCH_HEAD to work correctly.
>>
>> Signed-off-by: Junio C Hamano <gitster@pobox.com>
>> ---
>>   Documentation/config/fetch.txt  |  7 ++++++
>>   Documentation/fetch-options.txt | 10 +++++++++
>>   builtin/fetch.c                 | 19 +++++++++++++---
>>   builtin/pull.c                  |  3 ++-
>>   t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
>>   5 files changed, 72 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/config/fetch.txt 
>> b/Documentation/config/fetch.txt
>> index b20394038d..0aaa05e8c0 100644
>> --- a/Documentation/config/fetch.txt
>> +++ b/Documentation/config/fetch.txt
>> @@ -91,3 +91,10 @@ fetch.writeCommitGraph::
>>       merge and the write may take longer. Having an updated commit-graph
>>       file helps performance of many Git commands, including `git 
>> merge-base`,
>>       `git push -f`, and `git log --graph`. Defaults to false.
>> +
>> +fetch.writeFetchHEAD::
>> +    Setting it to false tells `git fetch` not to write the list
>> +    of remote refs fetched in the `FETCH_HEAD` file directly
>> +    under `$GIT_DIR`.  Can be countermanded from the command
>> +    line with the `--[no-]write-fetch-head` option.  Defaults to
>> +    true.
>> diff --git a/Documentation/fetch-options.txt 
>> b/Documentation/fetch-options.txt
>> index 6e2a160a47..6775e8499f 100644
>> --- a/Documentation/fetch-options.txt
>> +++ b/Documentation/fetch-options.txt
>> @@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
>>   --dry-run::
>>       Show what would be done, without making any changes.
>> +ifndef::git-pull[]
>> +--[no-]write-fetch-head::
>> +    Write the list of remote refs fetched in the `FETCH_HEAD`
>> +    file directly under `$GIT_DIR`.  This is the default unless
>> +    the configuration variable `fetch.writeFetchHEAD` is set to
>> +    false.  Passing `--no-write-fetch-head` from the command
>> +    line tells Git not to write the file.  Under `--dry-run`
>> +    option, the file is never written.
>> +endif::git-pull[]
>> +
>>   -f::
>>   --force::
>>       When 'git fetch' is used with `<src>:<dst>` refspec it may
>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 82ac4be8a5..3ccf69753f 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
>>   #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
>>   static int all, append, dry_run, force, keep, multiple, update_head_ok;
>> +static int write_fetch_head = 1;
>>   static int verbosity, deepen_relative, set_upstream;
>>   static int progress = -1;
>>   static int enable_auto_gc = 1;
>> @@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const 
>> char *v, void *cb)
>>           return 0;
>>       }
>> +    if (!strcmp(k, "fetch.writefetchhead")) {
>> +        write_fetch_head = git_config_bool(k, v);
>> +        return 0;
>> +    }
>>       return git_default_config(k, v, cb);
>>   }
>> @@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
>>               PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
>>       OPT_BOOL(0, "dry-run", &dry_run,
>>            N_("dry run")),
>> +    OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
>> +         N_("write fetched references to the FETCH_HEAD file")),
>>       OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
>>       OPT_BOOL('u', "update-head-ok", &update_head_ok,
>>               N_("allow updating of HEAD ref")),
>> @@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, 
>> const char *remote_name,
>>       const char *what, *kind;
>>       struct ref *rm;
>>       char *url;
>> -    const char *filename = dry_run ? "/dev/null" : 
>> git_path_fetch_head(the_repository);
>> +    const char *filename = (!write_fetch_head
>> +                ? "/dev/null"
>> +                : git_path_fetch_head(the_repository));
> 
> I was suspicious of this as we haven't cleared write_fetch_head in the 
> --dry-run case yet but the test below seems to show that we still don't 
> write FETCH_HEAD if --dry-run is given. That makes we wonder what the 
> point of setting the filename based on the value of write_fetch_head is.

Sorry ignore that. I misread the hunk header - we have in fact cleared 
write_fetch_head in the --dry-run case by the time we get here.

> The rest looks good to me, though it might be worth having a test to 
> check that pull does indeed reject --no-write-fetch-head
> 
> Best Wishes
> 
> Phillip
> 
>>       int want_status;
>>       int summary_width = transport_summary_width(ref_map);
>> @@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
>>       }
>>       /* if not appending, truncate FETCH_HEAD */
>> -    if (!append && !dry_run) {
>> +    if (!append && write_fetch_head) {
>>           retcode = truncate_fetch_head();
>>           if (retcode)
>>               goto cleanup;
>> @@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list 
>> *list, int max_children)
>>       int i, result = 0;
>>       struct argv_array argv = ARGV_ARRAY_INIT;
>> -    if (!append && !dry_run) {
>> +    if (!append && write_fetch_head) {
>>           int errcode = truncate_fetch_head();
>>           if (errcode)
>>               return errcode;
>> @@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, 
>> const char *prefix)
>>       if (depth || deepen_since || deepen_not.nr)
>>           deepen = 1;
>> +    /* FETCH_HEAD never gets updated in --dry-run mode */
>> +    if (dry_run)
>> +        write_fetch_head = 0;
>> +
>>       if (all) {
>>           if (argc == 1)
>>               die(_("fetch --all does not take a repository argument"));
>> diff --git a/builtin/pull.c b/builtin/pull.c
>> index 8159c5d7c9..e988d92b53 100644
>> --- a/builtin/pull.c
>> +++ b/builtin/pull.c
>> @@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char 
>> **refspecs)
>>       struct argv_array args = ARGV_ARRAY_INIT;
>>       int ret;
>> -    argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
>> +    argv_array_pushl(&args, "fetch", "--update-head-ok",
>> +             "--write-fetch-head", NULL);
>>       /* Shared options */
>>       argv_push_verbosity(&args);
>> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
>> index a66dbe0bde..3052c2d8d5 100755
>> --- a/t/t5510-fetch.sh
>> +++ b/t/t5510-fetch.sh
>> @@ -539,13 +539,48 @@ test_expect_success 'fetch into the current 
>> branch with --update-head-ok' '
>>   '
>> -test_expect_success 'fetch --dry-run' '
>> -
>> +test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
>>       rm -f .git/FETCH_HEAD &&
>>       git fetch --dry-run . &&
>>       ! test -f .git/FETCH_HEAD
>>   '
>> +test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git fetch --no-write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success '--write-fetch-head gets defeated by --dry-run' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git fetch --dry-run --write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=no fetch . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
>> +    test -f .git/FETCH_HEAD
>> +'
>> +
>>   test_expect_success "should be able to fetch with duplicate refspecs" '
>>       mkdir dups &&
>>       (
>>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (17 preceding siblings ...)
  2020-07-23 17:56   ` [PATCH v2 18/18] maintenance: add trace2 regions for task execution Derrick Stolee via GitGitGadget
@ 2020-07-29 22:03   ` Emily Shaffer
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
  19 siblings, 0 replies; 164+ messages in thread
From: Emily Shaffer @ 2020-07-29 22:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:22PM +0000, Derrick Stolee via GitGitGadget wrote:
> UPDATES in V2
> =============
> 
> I'm sending this between v2.28.0-rc2 adn v2.28.0 as the release things have
> become a bit quiet.
> 
>  * The biggest disruption to the range-diff is that I removed the premature
>    use of struct repository *r and instead continue to rely on 
>    the_repository. This means several patches were dropped that did prep
>    work in builtin/gc.c.
>    
>    
>  * I dropped the task hashmap and opted for a linear scan. This task list
>    will always be too small to justify the extra complication of the
>    hashmap.
>    
>    
>  * struct maintenance_opts is properly static now.
>    
>    
>  * Some tasks are renamed: fetch -> prefetch, pack-files ->
>    incremental-repack.
>    
>    
>  * With the rename, the prefetch task uses refs/prefetch/ instead of 
>    refs/hidden/.
>    
>    
>  * A trace2 region around the task executions are added.
>    
>    
> 
> Thanks, -Stolee

FYI: We covered this series in review club (me, Jonathan Tan, Jonathan
Nieder) so I'll send some mails with comments from all three of us.

We noticed reviewing that it looks like it's time to focus on the
details because the big picture of this series looks good. Since we were
able to review each patch independently, the approach you took to
separate the commits seems effective.

One concern I (Emily) had was about the use of midx without necessarily
checking the config, where users can opt out of using it. What are the
consequences of running maintenance that creates a multipack index
without the config turned on for the rest of Git?

 - Emily

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 01/18] maintenance: create basic maintenance runner
  2020-07-23 17:56   ` [PATCH v2 01/18] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
  2020-07-25  1:26     ` Taylor Blau
  2020-07-25  1:47     ` Đoàn Trần Công Danh
@ 2020-07-29 22:19     ` Jonathan Nieder
  2020-07-30 13:12       ` Derrick Stolee
  2 siblings, 1 reply; 164+ messages in thread
From: Jonathan Nieder @ 2020-07-29 22:19 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, peff, congdanhqx,
	phillip.wood123, emilyshaffer, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

Hi,

Derrick Stolee wrote:

> [Subject: maintenance: create basic maintenance runner]

Seems sensible, and a good way to set up for the later patches.  Let's
take a look at how it does that.

[...]
> --- /dev/null
> +++ b/Documentation/git-maintenance.txt
> @@ -0,0 +1,57 @@
[...]
> +SUBCOMMANDS
> +-----------
> +
> +run::
> +	Run one or more maintenance tasks.

[jrnieder] How do I supply the tasks on the command line?  Are they
parameters to this subcommand?  If so, it could make sense for this to
say something like

	run <task>...::

What is the exit code convention for "git maintenance run"?  (Many Git
commands don't document this, but since we're making a new command
it seems worth building the habit.)

[...]
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
[...]
> +static struct maintenance_opts {
> +	int auto_flag;
> +} opts;

Packing this in a struct feels a bit unusual.  Is the struct going to
be passed somewhere, or could these be individual locals in
cmd_maintenance?

[...]
> +
> +static int maintenance_task_gc(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "gc", NULL);
> +
> +	if (opts.auto_flag)
> +		argv_array_pushl(&cmd, "--auto", NULL);

These are both pushing single strings, so they can use argv_array_push.

[...]
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0

Why does this disable commit graph and multipack index?  Is that setting
up for something that comes later?

[...]
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt

This feels a little odd in two ways:

- the use of "git gc" isn't a user-facing detail, so this is testing
  implementation instead of the user-facing behavior.  That's okay,
  though --- it can be useful to test internals sometimes.

- the way that this tests for "git gc" feels brittle: if the trace
  emitter changes some day to include a space after the comma, for
  example, then the test would start failing.  I thought that in the
  spirit of fakes, it could make sense to write a custom "git gc"
  command using test_write_script, but that isn't likely to work
  either since gc is a builtin.

Perhaps this is suggesting that we need some central test helper for
parsing traces so we can do this reliably in one place.  What do you
think?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 04/18] maintenance: initialize task array
  2020-07-23 17:56   ` [PATCH v2 04/18] maintenance: initialize task array Derrick Stolee via GitGitGadget
  2020-07-23 19:57     ` Junio C Hamano
@ 2020-07-29 22:19     ` Emily Shaffer
  1 sibling, 0 replies; 164+ messages in thread
From: Emily Shaffer @ 2020-07-29 22:19 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:26PM +0000, Derrick Stolee via GitGitGadget wrote:
> +static void initialize_tasks(void)
> +{
> +	int i;
> +	num_tasks = 0;
> +
> +	for (i = 0; i < MAX_NUM_TASKS; i++)
> +		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));

[jonathan tan] I wonder why this is an array of pointers, not of
objects? If they were objects, we could use an initializer. But in a
later commit I see the array is sorted, OK.
> +
> +	tasks[num_tasks]->name = "gc";
> +	tasks[num_tasks]->fn = maintenance_task_gc;
> +	tasks[num_tasks]->enabled = 1;
> +	num_tasks++;

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-23 17:56   ` [PATCH v2 09/18] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
  2020-07-23 20:59     ` Junio C Hamano
@ 2020-07-29 22:21     ` Emily Shaffer
  2020-07-30 15:38       ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Emily Shaffer @ 2020-07-29 22:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
> +loose-objects::
> +	The `loose-objects` job cleans up loose objects and places them into
> +	pack-files. In order to prevent race conditions with concurrent Git
> +	commands, it follows a two-step process. First, it deletes any loose
> +	objects that already exist in a pack-file; concurrent Git processes
> +	will examine the pack-file for the object data instead of the loose
> +	object. Second, it creates a new pack-file (starting with "loose-")

[jonathan tan + jonathan nieder] If you are going to document this,
probably it should also be tested, so the documentation does not become
stale. Or, just don't document it.

> +static int pack_loose(void)
> +{
> +	struct repository *r = the_repository;
> +	int result = 0;
> +	struct write_loose_object_data data;
> +	struct strbuf prefix = STRBUF_INIT;
> +	struct child_process *pack_proc;
> +
> +	/*
> +	 * Do not start pack-objects process
> +	 * if there are no loose objects.
> +	 */
> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
> +					   loose_object_exists,
> +					   NULL, NULL, NULL))

[emily] To me, this is unintuitive - but upon inspection, it's exiting
the foreach early if any loose object is found, so this is cheaper than
actually counting. Maybe a comment would help to understand? Or we could
name the function differently, like "bail_if_loose()" or something?

> +test_expect_success 'loose-objects task' '
> +	# Repack everything so we know the state of the object dir
> +	git repack -adk &&
> +
> +	# Hack to stop maintenance from running during "git commit"
> +	echo in use >.git/objects/maintenance.lock &&
> +	test_commit create-loose-object &&

[jonathan nieder] Does it make sense to use a different git command
which is guaranteed to make a loose object? Is 'git commit' futureproof,
if we decide commits should directly create packs in the future? 'git
unpack-objects' is guaranteed to make a loose object, although it is
clumsy because it needs a packfile to begin with...

[jonathan tan] But, using 'git commit' is easier to understand in this
context. Maybe commenting to say that we assume 'git commit' makes 1 or
more loose objects will be enough to futureproof - then we have a signal
to whoever made a change to make this fail, and that person knows how to
fix this test.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 10/18] maintenance: add incremental-repack task
  2020-07-23 17:56   ` [PATCH v2 10/18] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
  2020-07-23 22:00     ` Junio C Hamano
@ 2020-07-29 22:22     ` Emily Shaffer
  1 sibling, 0 replies; 164+ messages in thread
From: Emily Shaffer @ 2020-07-29 22:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.

[emily shaffer] Will this use midx even if the user has disabled it in
their config?

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index 94bb493733..3ec813979a 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' '
>  	test_cmp packs-between packs-after
>  '

[emily shaffer] Can we include a test to prove that this task is or is
not run if core.multipackindex is set or unset? That behavior is hard to
deduce from the code... we might want to be cautious.

>  
> +test_expect_success 'incremental-repack task' '
> +	packDir=.git/objects/pack &&
> +	for i in $(test_seq 1 5)
> +	do
> +		test_commit $i || return 1
> +	done &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
> +	test_tick &&
> +	git pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +	test_tick &&
> +	git pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +	rm -f $packDir/pack-* &&
> +	rm -f $packDir/loose-* &&
> +	ls $packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git maintenance run --task=incremental-repack &&
> +	ls $packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git maintenance run --task=incremental-repack &&
> +	ls .git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-23 17:56   ` [PATCH v2 11/18] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
  2020-07-23 22:15     ` Junio C Hamano
@ 2020-07-29 22:23     ` Emily Shaffer
  2020-07-30 16:57       ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Emily Shaffer @ 2020-07-29 22:23 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/builtin/gc.c b/builtin/gc.c
> index eb4b01c104..889d97afe7 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>  	return result;
>  }
>  
> +#define TWO_GIGABYTES (2147483647)

[jonathan tan] This would be easier to understand if it was expressed
with bitshift, e.g. 1 << 31

> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
[jonathan tan] This looks like it's never used... and vulnerable to
cross-platform size changes because it's referring to an implicitly
sized int, and could behave differently if it was put into a larger
size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long
long.

> +	for (p = get_all_packs(r); p; p = p->next) {
> +		if (p->pack_size > max_size) {
> +			second_largest_size = max_size;
> +			max_size = p->pack_size;
> +		} else if (p->pack_size > second_largest_size)
> +			second_largest_size = p->pack_size;
> +	}
> +
> +	result_size = second_largest_size + 1;
[jonathan tan] What happens when there's only one packfile, and when
there are two packfiles? Can we write tests to illustrate the behavior?
The edge case here (result_size=1) is hard to understand by reading the
code.

> +
> +	/* But limit ourselves to a batch size of 2g */
[emily shaffer] nit: can you please capitalize G, GB, whatever :)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 01/18] maintenance: create basic maintenance runner
  2020-07-29 22:19     ` Jonathan Nieder
@ 2020-07-30 13:12       ` Derrick Stolee
  2020-07-31  0:30         ` Jonathan Nieder
                           ` (2 more replies)
  0 siblings, 3 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 13:12 UTC (permalink / raw)
  To: Jonathan Nieder, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, peff, congdanhqx,
	phillip.wood123, emilyshaffer, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> Hi,
> 
> Derrick Stolee wrote:
> 
>> [Subject: maintenance: create basic maintenance runner]
> 
> Seems sensible, and a good way to set up for the later patches.  Let's
> take a look at how it does that.
> 
> [...]
>> --- /dev/null
>> +++ b/Documentation/git-maintenance.txt
>> @@ -0,0 +1,57 @@
> [...]
>> +SUBCOMMANDS
>> +-----------
>> +
>> +run::
>> +	Run one or more maintenance tasks.
> 
> [jrnieder] How do I supply the tasks on the command line?  Are they
> parameters to this subcommand?  If so, it could make sense for this to
> say something like
> 
> 	run <task>...::

Hopefully this is documented to your satisfaction when the ability
to customize the tasks is implemented.

> What is the exit code convention for "git maintenance run"?  (Many Git
> commands don't document this, but since we're making a new command
> it seems worth building the habit.)

Is this worth doing? Do we really want every command to document
that "0 means success, everything else means failure, and some of
those exit codes mean a specific kind of failure that is global to
Git"?

> [...]
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
> [...]
>> +static struct maintenance_opts {
>> +	int auto_flag;
>> +} opts;
> 
> Packing this in a struct feels a bit unusual.  Is the struct going to
> be passed somewhere, or could these be individual locals in
> cmd_maintenance?

This will grow, and I'd rather have one global struct than many
individual global items. It makes it clearer when I use
"opts.auto_flag" that this corresponds to whether "--auto" was
provided as a command-line option.

> [...]
>> +
>> +static int maintenance_task_gc(void)
>> +{
>> +	int result;
>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>> +
>> +	argv_array_pushl(&cmd, "gc", NULL);
>> +
>> +	if (opts.auto_flag)
>> +		argv_array_pushl(&cmd, "--auto", NULL);
> 
> These are both pushing single strings, so they can use argv_array_push.

Thanks. I noticed a few of these myself. Luckily, I'll be going
through all of these invocations and replacing them with new
methods soon ;)

[1] https://lore.kernel.org/git/30933a71-3130-5478-cbfd-0ca5bb308cf2@gmail.com/

> [...]
>> --- /dev/null
>> +++ b/t/t7900-maintenance.sh
>> @@ -0,0 +1,22 @@
>> +#!/bin/sh
>> +
>> +test_description='git maintenance builtin'
>> +
>> +GIT_TEST_COMMIT_GRAPH=0
>> +GIT_TEST_MULTI_PACK_INDEX=0
> 
> Why does this disable commit graph and multipack index?  Is that setting
> up for something that comes later?

Yes, these don't need to be here yet.

> [...]
>> +test_expect_success 'gc [--auto]' '
>> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
>> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
>> +	grep ",\"gc\"]" run-no-auto.txt  &&
>> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> 
> This feels a little odd in two ways:
> 
> - the use of "git gc" isn't a user-facing detail, so this is testing
>   implementation instead of the user-facing behavior.  That's okay,
>   though --- it can be useful to test internals sometimes.

Consider this a "unit test" of the builtin. I'm testing whether the
command-line arguments had an effect on the child process.

> - the way that this tests for "git gc" feels brittle: if the trace
>   emitter changes some day to include a space after the comma, for
>   example, then the test would start failing.  I thought that in the
>   spirit of fakes, it could make sense to write a custom "git gc"
>   command using test_write_script, but that isn't likely to work
>   either since gc is a builtin.
> 
> Perhaps this is suggesting that we need some central test helper for
> parsing traces so we can do this reliably in one place.  What do you
> think?

I'm specifically using GIT_TRACE2_EVENT, which is intended for
consumption by computer, not humans. Adding whitespace or otherwise
changing the output format would be an unnecessary disruption that
is more likely to have downstream effects with external tools.

In some way, adding extra dependencies like this in our own test
suite can help ensure the output format is more stable.

If there is a better way to ask "Did my command call 'git gc' (with
no arguments|with these arguments)?" then I'm happy to consider it.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-23 20:21     ` Junio C Hamano
  2020-07-25  1:33       ` Taylor Blau
@ 2020-07-30 13:29       ` Derrick Stolee
  2020-07-30 13:31         ` Derrick Stolee
  1 sibling, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 13:29 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/23/2020 4:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +--[no-]maintenance::
>>  --[no-]auto-gc::
>> -	Run `git gc --auto` at the end to perform garbage collection
>> -	if needed. This is enabled by default.
>> +	Run `git maintenance run --auto` at the end to perform garbage
>> +	collection if needed. This is enabled by default.
> 
> Shouldn't the new synonym be called --auto-maintenance or an
> abbreviation thereof?  It is not like we will run the full
> maintenance suite when "--no-maintenance" is omitted, which
> certainly is not the impression we want to give our readers.

Adding "auto-" to the argument can be informative.

I would think that abbreviating the option may make understanding
the argument more difficult for users where English is not their
first language. Of course, I'm a bad judge of that.

I also don't think this option is called often by users at the
command-line and instead is done by scripts and third-party tools.
Abbreviating the word may have less value there.

(This reminds me that is would be good to have a config option that
prevents running this process _at all_. Sure, gc.auto=0 prevents the
command from doing anything, but there is still an extra cost of
starting a process. This is more of a problem on Windows. Making a
note to self here.)

>>  These objects may be removed by normal Git operations (such as `git commit`)
>> -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
>> -If these objects are removed and were referenced by the cloned repository,
>> -then the cloned repository will become corrupt.
>> +which automatically call `git maintenance run --auto` and `git gc --auto`.
> 
> Hmph.  Perhaps the picture may change in the end of the series but I
> got an impression that "gc --auto" would eventually become just part
> of "maintenance --auto" and the users won't have to be even aware of
> its existence?  Wouldn't we rather want to say something like
> 
> 	--[no-]auto-maintenance::
> 	--[no-]auto-gc::
>                 Run `git maintenance run --auto` at the end to perform
>                 garbage collection if needed (`--[no-]auto-gc` is a
>                 synonym).  This is enabled by default.

I don't completely eliminate 'git gc' at the end of this series, but
instead hope that we can peel it apart slowly in follow-up series.
However, you are correct that I should be more careful to obliterate
it from the documentation instead of keeping references around.

>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 82ac4be8a5..49a4d727d4 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
>>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>>  			N_("report that we have only objects reachable from this object")),
>>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
> 
>> +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
>> +		 N_("run 'maintenance --auto' after fetching")),
>>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
>> +		 N_("run 'maintenance --auto' after fetching")),
> 
> OK, so this is truly a backward-compatible synonym at this point.
> 
>> diff --git a/run-command.c b/run-command.c
>> index 9b3a57d1e3..82ad241638 100644
>> --- a/run-command.c
>> +++ b/run-command.c
>> @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
>>  	return result;
>>  }
>>  
>> -int run_auto_gc(int quiet)
>> +int run_auto_maintenance(int quiet)
>>  {
>>  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
>>  	int status;
>>  
>> -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
>> +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
>>  	if (quiet)
>>  		argv_array_push(&argv_gc_auto, "--quiet");
>> +	else
>> +		argv_array_push(&argv_gc_auto, "--no-quiet");
>> +
>>  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
>>  	argv_array_clear(&argv_gc_auto);
>>  	return status;
> 
> Don't we want to replace all _gc_ with _maintenance_ in this
> function?  I think the first business before we can do so would be
> to rethink if spelling out "maintenance" fully in code is a good
> idea in the first space.  It would make names for variables,
> structures and fields unnecessarily long without contributing to
> ease of understanding an iota, and a easy-to-remember short-form or
> an abbreviation may be needed.  Using a short-form/abbreviation
> wouldn't worsen the end-user experience, and not the developer
> experience for that matter.
> 
> If we choose "gc" as the short-hand, most of the change in this step
> would become unnecessary.  I also do not mind if we some other words
> or word-fragment (perhaps "maint"???) is chosen.

Yes, I should have noticed that. Also, with Peff's feedback from
another thread, the method can look a bit simpler this way:

int run_auto_maintenance(int quiet)
{
	struct child_process maint = CHILD_PROCESS_INIT;
	maint.git_cmd = 1;

	argv_array_pushl(&maint.argv, "maintenance", "run", "--auto", NULL);
	if (quiet)
		argv_array_push(&maint.argv, "--quiet");
	else
		argv_array_push(&maint.argv, "--no-quiet");

	return run_command(&maint.argv);
}

(I will update it again to work on Peff's argv_array work, but
hopefully it is clear how this is simpler.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-30 13:29       ` Derrick Stolee
@ 2020-07-30 13:31         ` Derrick Stolee
  2020-07-30 19:00           ` Eric Sunshine
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 13:31 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/30/2020 9:29 AM, Derrick Stolee wrote:
> On 7/23/2020 4:21 PM, Junio C Hamano wrote:
>> Don't we want to replace all _gc_ with _maintenance_ in this
>> function?  I think the first business before we can do so would be
>> to rethink if spelling out "maintenance" fully in code is a good
>> idea in the first space.  It would make names for variables,
>> structures and fields unnecessarily long without contributing to
>> ease of understanding an iota, and a easy-to-remember short-form or
>> an abbreviation may be needed.  Using a short-form/abbreviation
>> wouldn't worsen the end-user experience, and not the developer
>> experience for that matter.
>>
>> If we choose "gc" as the short-hand, most of the change in this step
>> would become unnecessary.  I also do not mind if we some other words
>> or word-fragment (perhaps "maint"???) is chosen.
> 
> Yes, I should have noticed that. Also, with Peff's feedback from
> another thread, the method can look a bit simpler this way:

It would help if I actually _compile_ code before sending it.
Here is the fixed version:

int run_auto_maintenance(int quiet)
{
	struct child_process maint = CHILD_PROCESS_INIT;
	maint.git_cmd = 1;

	argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
	if (quiet)
		argv_array_push(&maint.args, "--quiet");
	else
		argv_array_push(&maint.args, "--no-quiet");

	return run_command(&maint);
}


^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 05/18] maintenance: add commit-graph task
  2020-07-25  1:52           ` Taylor Blau
@ 2020-07-30 13:59             ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 13:59 UTC (permalink / raw)
  To: Taylor Blau, Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, congdanhqx, phillip.wood123,
	emilyshaffer, sluongng, jonathantanmy, Derrick Stolee,
	Derrick Stolee

On 7/24/2020 9:52 PM, Taylor Blau wrote:
> On Fri, Jul 24, 2020 at 12:47:00PM -0700, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> But you are discussing here how the _behavior_ can change when
>>> --auto is specified. And specifically, "git gc --auto" really
>>> meant "This is running after a foreground command, so only do
>>> work if necessary and do it quickly to minimize blocking time."
>>>
>>> I'd be happy to replace "--auto" with "--quick" in the
>>> maintenance builtin.
>>>
>>> This opens up some extra design space for how the individual
>>> tasks perform depending on "--quick" being specified or not.
>>> My intention was to create tasks that are already in "quick"
>>> mode:
>>>
>>> * loose-objects have a maximum batch size.
>>> * incremental-repack is capped in size.
>>> * commit-graph uses the --split option.
>>>
>>> But this "quick" distinction might be important for some of
>>> the tasks we intend to extract from the gc builtin.
>>
>> Yup.  To be honest, I came to this topic from a completely different
>> direction.  The field name "auto" alone (and no other field name)
>> had to have an extra cruft (i.e. "_flag") attached to it, which is
>> understandable but ugly.  Then I started thinking if 'auto(matic)'
>> is really the right word to describe what we want out of the option,
>> and came to the realization that there may be better words.
> 
> I wonder what the quick and slow paths are here. For the commit-graph
> code, what you wrote here seems to match what I'd expect with passing
> '--auto' in the sense of running 'git gc'. That is, I'm leaving it up to
> the commit-graph machinery's idea of the normal '--split' rules to
> figure out when to roll up layers of a commit-graph, as opposed to
> creating a new layer and extending the chain.

I had intended all of my new tasks to be the "quick" version of their
operations. The "slow" version would abandon hope of doing a small
amount of work to create the best possible world for the repository.
This would include:

 * The commit-graph would collapse all layers into one file.
 * The multi-pack-index repack would rewrite all object data into one
   pack-file.
 * The loose-objects task would not stop at a maximum number of loose
   objects (and would probably want to repack everything, anyway).

I'm open to making this possibility more explicit by renaming "--auto"
and just performing a translation to 'git gc --auto'. So, what should
the name be? Here are a few options to consider:

 --quick
 --fast
 --limited
 --incremental
 -O[0|1|2...] (think GCC optimization flags, exposing granularity)
 --[non-]aggressive

Regardless, this makes me rethink that the --[no-]maintenance option
from PATCH 03/18 is better than --[no-]auto-maintenance, since we are
really saying "run _some_ maintenance or _no_ maintenance" and the "how"
of the maintenance is left intentionally vague. I've already made the
change locally to add "auto-" so I'll wait for confirmation before
reverting that change.

> So, I think that makes sense if the caller gave '--auto'. But, I'm not
> sure that it makes sense if they didn't, in which case I'd imagine
> something quicker to happen. There, I'd expect something more like:
> 
>   1. Run 'git commit-graph write --reachable --split=no-merge'.
>   2. Run 'git commit-graph verify'.
>   3. If 'git commit-graph verify' failed, drop the existing commit graph
>      and rebuild it with 'git commit-graph --reachable --split=replace'.
>   4. Otherwise, do nothing.
> 
> I'm biased, of course, but I think that that matches roughly what I'd
> expect to happen in the fast/slow path. Granted, the steps to rebuild
> the commit graph are going to be slow no matter what (depending on the
> size of the repository), and so in that case maybe the commit-graph
> should just be dropped. I'm not really sure what to do about that...

I think this approach is the best we can do given the current behavior
inside the commit-graph builtin. Perhaps in the future we could change
the commit-graph builtin to include a "--verify" option so it could do
the "git commit-graph verify --shallow" on the new layer before committing
the new commit-graph-chain file and expiring old layers. That way, we
would not need to delete and rewrite the whole thing when there is a
problem writing the top layer.

>>> Since the tasks are frequently running subcommands, returning
>>> 0 for success and non-zero for error matches the error codes
>>> returned by those subcommands.
>>
>> As long as these will _never_ be called from other helper functions
>> but from the cmd_foo() top-level and their return values are only
>> used directly as the top-level's return value, I do not mind too
>> much.
>>
>> But whenever I am writing such a code, I find myself not brave
>> enough to make such a bold promise (I saw other people call the
>> helpers I wrote in unintended ways and had to adjust the semantics
>> of them to accomodate the new callers too many times), so I'd rather
>> see the caller do "return !!helper_fn()" to allow helper_fn() to be
>> written more naturally (e.g. letting them return error(...)).

I will try to be consistent here with the behavior:

 * 0 is success
 * 1 is failure

Which is what I think you are implying by "return !!helper_fn()".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update
  2020-07-28 16:37               ` [PATCH v2] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano
  2020-07-29  9:12                 ` Phillip Wood
@ 2020-07-30 15:17                 ` Derrick Stolee
  1 sibling, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 15:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Đoàn Trần Công Danh,
	Derrick Stolee via GitGitGadget, git, Johannes.Schindelin,
	sandals, steadmon, jrnieder, peff, phillip.wood123, emilyshaffer,
	sluongng, jonathantanmy, Derrick Stolee, Derrick Stolee

On 7/28/2020 12:37 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>
>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>
>>> I do not think it is wise to base a work on top of unfinished "you
>>> could do it this way, perhaps?" demonstration patch the original
>>> author does not have much inclination to finish, though.
>>>
>>> When I am really bored, I may go back to the topic to finish it, but
>>> I wouldn't mind if you took ownership of it at all.
>>
>> Ah. I didn't understand the status of that branch. I'll pull it in
>> to this topic.
> 
> So here is with one of the two things that I found missing in the
> first iteration of the patch: documentation.
> 
> The other thing that I found iffy (and still missing from this
> version) was what should be done when "git pull" is explicitly given
> the "--no-write-fetch-head" option.
> 
> I think (but didn't check the recent code) that 'git pull' would
> pass only known-to-make-sense command line options to underlying
> 'git fetch', so it probably will barf with "unknown option", which
> is the best case.  We might want to make it sure with a new test in
> 5521.  On the other hand, if we get anything other than "no such
> option", we may want to think if we want to "fix" it or just leave
> it inside "if it hurts, don't do it" territory.

Here is the version I applied and updated. Please let me know what
you think.

-->8--

From 3f60a0f0fd388447aa9c2b805b5646039df98d0b Mon Sep 17 00:00:00 2001
From: Junio C Hamano <gitster@pobox.com>
Date: Tue, 28 Jul 2020 09:37:59 -0700
Subject: [PATCH] fetch: optionally allow disabling FETCH_HEAD update

If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head"
and "fetch.writeFetchHEAD" configuration variable.  Without either,
the default is to write FETCH_HEAD, and the usual rule that the
command line option defeats configured default applies.

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--write-fetch-head` does not force `git
fetch` to write the file.

Also note that this option is explicitly passed when "git pull"
internally invokes "git fetch", so that those who configured their
"git fetch" not to write FETCH_HEAD would not be able to break the
cooperation between these two commands.  "git pull" must see what
"git fetch" got recorded in FETCH_HEAD to work correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/fetch.txt  |  7 ++++++
 Documentation/fetch-options.txt | 10 +++++++++
 builtin/fetch.c                 | 19 +++++++++++++---
 builtin/pull.c                  |  3 ++-
 t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
 t/t5521-pull-options.sh         | 16 ++++++++++++++
 6 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
index b20394038d1..0aaa05e8c0e 100644
--- a/Documentation/config/fetch.txt
+++ b/Documentation/config/fetch.txt
@@ -91,3 +91,10 @@ fetch.writeCommitGraph::
 	merge and the write may take longer. Having an updated commit-graph
 	file helps performance of many Git commands, including `git merge-base`,
 	`git push -f`, and `git log --graph`. Defaults to false.
+
+fetch.writeFetchHEAD::
+	Setting it to false tells `git fetch` not to write the list
+	of remote refs fetched in the `FETCH_HEAD` file directly
+	under `$GIT_DIR`.  Can be countermanded from the command
+	line with the `--[no-]write-fetch-head` option.  Defaults to
+	true.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 495bc8ab5a1..6972ad2522c 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
 --dry-run::
 	Show what would be done, without making any changes.
 
+ifndef::git-pull[]
+--[no-]write-fetch-head::
+	Write the list of remote refs fetched in the `FETCH_HEAD`
+	file directly under `$GIT_DIR`.  This is the default unless
+	the configuration variable `fetch.writeFetchHEAD` is set to
+	false.  Passing `--no-write-fetch-head` from the command
+	line tells Git not to write the file.  Under `--dry-run`
+	option, the file is never written.
+endif::git-pull[]
+
 -f::
 --force::
 	When 'git fetch' is used with `<src>:<dst>` refspec it may
diff --git a/builtin/fetch.c b/builtin/fetch.c
index d8253f66e4c..1d7194aac26 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok;
+static int write_fetch_head = 1;
 static int verbosity, deepen_relative, set_upstream;
 static int progress = -1;
 static int enable_auto_gc = 1;
@@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writefetchhead")) {
+		write_fetch_head = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
+	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
+		 N_("write fetched references to the FETCH_HEAD file")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
 	OPT_BOOL('u', "update-head-ok", &update_head_ok,
 		    N_("allow updating of HEAD ref")),
@@ -895,7 +902,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *what, *kind;
 	struct ref *rm;
 	char *url;
-	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
+	const char *filename = (!write_fetch_head
+				? "/dev/null"
+				: git_path_fetch_head(the_repository));
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
 
@@ -1329,7 +1338,7 @@ static int do_fetch(struct transport *transport,
 	}
 
 	/* if not appending, truncate FETCH_HEAD */
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		retcode = truncate_fetch_head();
 		if (retcode)
 			goto cleanup;
@@ -1596,7 +1605,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
 	int i, result = 0;
 	struct argv_array argv = ARGV_ARRAY_INIT;
 
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1797,6 +1806,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	/* FETCH_HEAD never gets updated in --dry-run mode */
+	if (dry_run)
+		write_fetch_head = 0;
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/builtin/pull.c b/builtin/pull.c
index 8159c5d7c96..e988d92b535 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
 	struct argv_array args = ARGV_ARRAY_INIT;
 	int ret;
 
-	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
+	argv_array_pushl(&args, "fetch", "--update-head-ok",
+			 "--write-fetch-head", NULL);
 
 	/* Shared options */
 	argv_push_verbosity(&args);
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index 9850ecde5df..31c91d0ed2e 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
 
 '
 
-test_expect_success 'fetch --dry-run' '
-
+test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
 	rm -f .git/FETCH_HEAD &&
 	git fetch --dry-run . &&
 	! test -f .git/FETCH_HEAD
 '
 
+test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success '--write-fetch-head gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --dry-run --write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
+	test -f .git/FETCH_HEAD
+'
+
 test_expect_success "should be able to fetch with duplicate refspecs" '
 	mkdir dups &&
 	(
diff --git a/t/t5521-pull-options.sh b/t/t5521-pull-options.sh
index 159afa7ac81..1acae3b9a4f 100755
--- a/t/t5521-pull-options.sh
+++ b/t/t5521-pull-options.sh
@@ -77,6 +77,7 @@ test_expect_success 'git pull -q -v --no-rebase' '
 	test_must_be_empty out &&
 	test -s err)
 '
+
 test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	mkdir clonedcleanup &&
 	(cd clonedcleanup && git init &&
@@ -85,6 +86,21 @@ test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	test -s err)
 '
 
+test_expect_success 'git pull --no-write-fetch-head fails' '
+	mkdir clonedwfh &&
+	(cd clonedwfh && git init &&
+	test_must_fail git pull --no-write-fetch-head "../parent" >out 2>err &&
+	test_must_be_empty out &&
+	test_i18ngrep "no-write-fetch-head" err)
+'
+
+test_expect_success 'git pull succeeds with fetch.writeFetchHEAD=false' '
+	mkdir clonedwfhconfig &&
+	(cd clonedwfhconfig && git init &&
+	git config fetch.writeFetchHEAD false &&
+	git pull "../parent" >out 2>err &&
+	grep FETCH_HEAD err)
+'
 
 test_expect_success 'git pull --force' '
 	mkdir clonedoldstyle &&
-- 
2.27.0.366.g34746c1d11e.dirty




^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 09/18] maintenance: add loose-objects task
  2020-07-29 22:21     ` Emily Shaffer
@ 2020-07-30 15:38       ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 15:38 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 7/29/2020 6:21 PM, Emily Shaffer wrote:
> On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +loose-objects::
>> +	The `loose-objects` job cleans up loose objects and places them into
>> +	pack-files. In order to prevent race conditions with concurrent Git
>> +	commands, it follows a two-step process. First, it deletes any loose
>> +	objects that already exist in a pack-file; concurrent Git processes
>> +	will examine the pack-file for the object data instead of the loose
>> +	object. Second, it creates a new pack-file (starting with "loose-")
> 
> [jonathan tan + jonathan nieder] If you are going to document this,
> probably it should also be tested, so the documentation does not become
> stale. Or, just don't document it.

Adding a condition to the test is a good idea.

>> +static int pack_loose(void)
>> +{
>> +	struct repository *r = the_repository;
>> +	int result = 0;
>> +	struct write_loose_object_data data;
>> +	struct strbuf prefix = STRBUF_INIT;
>> +	struct child_process *pack_proc;
>> +
>> +	/*
>> +	 * Do not start pack-objects process
>> +	 * if there are no loose objects.
>> +	 */
>> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
>> +					   loose_object_exists,
>> +					   NULL, NULL, NULL))
> 
> [emily] To me, this is unintuitive - but upon inspection, it's exiting
> the foreach early if any loose object is found, so this is cheaper than
> actually counting. Maybe a comment would help to understand? Or we could
> name the function differently, like "bail_if_loose()" or something?

Sure. "bail_on_loose()" makes more sense to me.

>> +test_expect_success 'loose-objects task' '
>> +	# Repack everything so we know the state of the object dir
>> +	git repack -adk &&
>> +
>> +	# Hack to stop maintenance from running during "git commit"
>> +	echo in use >.git/objects/maintenance.lock &&
>> +	test_commit create-loose-object &&
> 
> [jonathan nieder] Does it make sense to use a different git command
> which is guaranteed to make a loose object? Is 'git commit' futureproof,
> if we decide commits should directly create packs in the future? 'git
> unpack-objects' is guaranteed to make a loose object, although it is
> clumsy because it needs a packfile to begin with...

"unpack-objects" also has the problem that it won't write the loose
object if it exists as a packed object, so it requires moving the
pack-file out of the object directory first. (That's also required
for testing that the loose objects get packed by the maintenance
task, but I'd rather not need to remove the multi-pack-index, too.)

> [jonathan tan] But, using 'git commit' is easier to understand in this
> context. Maybe commenting to say that we assume 'git commit' makes 1 or
> more loose objects will be enough to futureproof - then we have a signal
> to whoever made a change to make this fail, and that person knows how to
> fix this test.

I'll add the comment for now. Thanks.

-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-29 22:23     ` Emily Shaffer
@ 2020-07-30 16:57       ` Derrick Stolee
  2020-07-30 19:02         ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 16:57 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 7/29/2020 6:23 PM, Emily Shaffer wrote:
> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index eb4b01c104..889d97afe7 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>>  	return result;
>>  }
>>  
>> +#define TWO_GIGABYTES (2147483647)
> 
> [jonathan tan] This would be easier to understand if it was expressed
> with bitshift, e.g. 1 << 31

This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
on Windows. But it's better to not use magic numbers and instead use
operations like this.

>> +#define UNSET_BATCH_SIZE ((unsigned long)-1)
> [jonathan tan] This looks like it's never used... and vulnerable to
> cross-platform size changes because it's referring to an implicitly
> sized int, and could behave differently if it was put into a larger
> size, e.g. you wouldn't get 0xFFFF.. if you assigned this into a long
> long.

Thanks for catching this cruft.

>> +	for (p = get_all_packs(r); p; p = p->next) {
>> +		if (p->pack_size > max_size) {
>> +			second_largest_size = max_size;
>> +			max_size = p->pack_size;
>> +		} else if (p->pack_size > second_largest_size)
>> +			second_largest_size = p->pack_size;
>> +	}
>> +
>> +	result_size = second_largest_size + 1;
> [jonathan tan] What happens when there's only one packfile, and when
> there are two packfiles? Can we write tests to illustrate the behavior?
> The edge case here (result_size=1) is hard to understand by reading the
> code.
> 
>> +
>> +	/* But limit ourselves to a batch size of 2g */
> [emily shaffer] nit: can you please capitalize G, GB, whatever :)

I suppose I could (get_unit_factor() performs case-insensitive matches
on the size suffixes), except that would be inconsistent with the
following error message in parse-options.c:

	return error(_("%s expects a non-negative integer value"
		       " with an optional k/m/g suffix"),

or these options in Documentation/git-pack-objects.txt:

--window-memory=<n>::
	This option provides an additional limit on top of `--window`;
	the window size will dynamically scale down so as to not take
	up more than '<n>' bytes in memory.  This is useful in
	repositories with a mix of large and small objects to not run
	out of memory with a large window, but still be able to take
	advantage of the large window for the smaller objects.  The
	size can be suffixed with "k", "m", or "g".
	`--window-memory=0` makes memory usage unlimited.  The default
	is taken from the `pack.windowMemory` configuration variable.

--max-pack-size=<n>::
	In unusual scenarios, you may not be able to create files
	larger than a certain size on your filesystem, and this option
	can be used to tell the command to split the output packfile
	into multiple independent packfiles, each not larger than the
	given size. The size can be suffixed with
	"k", "m", or "g". The minimum size allowed is limited to 1 MiB.
	This option
	prevents the creation of a bitmap index.
	The default is unlimited, unless the config variable
	`pack.packSizeLimit` is set.

Perhaps that's not really a good reason, but upper vs lower case
seems to be arbitrary. Any tie breaker seems valid.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-30 13:31         ` Derrick Stolee
@ 2020-07-30 19:00           ` Eric Sunshine
  2020-07-30 20:21             ` Derrick Stolee
  0 siblings, 1 reply; 164+ messages in thread
From: Eric Sunshine @ 2020-07-30 19:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, Git List,
	Johannes Schindelin, brian m. carlson, Josh Steadmon,
	Jonathan Nieder, Jeff King, Doan Tran Cong Danh, Phillip Wood,
	Emily Shaffer, Son Luong Ngoc, Jonathan Tan, Derrick Stolee,
	Derrick Stolee

On Thu, Jul 30, 2020 at 9:31 AM Derrick Stolee <stolee@gmail.com> wrote:
> int run_auto_maintenance(int quiet)
> {
>         struct child_process maint = CHILD_PROCESS_INIT;
>         maint.git_cmd = 1;
>
>         argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>         if (quiet)
>                 argv_array_push(&maint.args, "--quiet");
>         else
>                 argv_array_push(&maint.args, "--no-quiet");

It's subjective, but this might be a good fit for the ternary operator:

    argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
    argv_array_push(&maint.args, quiet ? "--quiet" : "--no-quiet");

Or even:

    argv_array_pushl(&maint.args, "maintenance", "run", "--auto",
        quiet ? "--quiet" : "--no-quiet", NULL);

The latter is a bit less readable to me.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-30 16:57       ` Derrick Stolee
@ 2020-07-30 19:02         ` Derrick Stolee
  2020-07-30 19:24           ` Chris Torek
  2020-08-05 12:37           ` Đoàn Trần Công Danh
  0 siblings, 2 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 19:02 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On 7/30/2020 12:57 PM, Derrick Stolee wrote:
> On 7/29/2020 6:23 PM, Emily Shaffer wrote:
>> On Thu, Jul 23, 2020 at 05:56:33PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> diff --git a/builtin/gc.c b/builtin/gc.c
>>> index eb4b01c104..889d97afe7 100644
>>> --- a/builtin/gc.c
>>> +++ b/builtin/gc.c
>>> @@ -1021,19 +1021,65 @@ static int multi_pack_index_expire(void)
>>>  	return result;
>>>  }
>>>  
>>> +#define TWO_GIGABYTES (2147483647)
>>
>> [jonathan tan] This would be easier to understand if it was expressed
>> with bitshift, e.g. 1 << 31
> 
> This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
> on Windows. But it's better to not use magic numbers and instead use
> operations like this.

Nevermind. This breaks the build on 32-bit machines (even adding a few
"L" characters). I'll replace my magic decimal number with a magic
hexadecimal number.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 11/18] maintenance: auto-size incremental-repack batch
  2020-07-30 19:02         ` Derrick Stolee
@ 2020-07-30 19:24           ` Chris Torek
  2020-08-05 12:37           ` Đoàn Trần Công Danh
  1 sibling, 0 replies; 164+ messages in thread
From: Chris Torek @ 2020-07-30 19:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, Git List,
	Johannes Schindelin, brian m. carlson, steadmon, jrnieder,
	Jeff King, congdanhqx, phillip.wood123, sluongng, jonathantanmy,
	Derrick Stolee, Derrick Stolee

On Thu, Jul 30, 2020 at 12:04 PM Derrick Stolee <stolee@gmail.com> wrote:
> > This is actually ((1 << 31) - 1) because "unsigned long" is 32-bits
> > on Windows. But it's better to not use magic numbers and instead use
> > operations like this.
>
> Nevermind. This breaks the build on 32-bit machines (even adding a few
> "L" characters). I'll replace my magic decimal number with a magic
> hexadecimal number.

You would need something along these lines:

    #define WHATEVER ((long)((1UL << 31) - 1))

i.e., make the constant 1 be "unsigned long" so that the shift is
well defined, then subtract 1, then convert back to signed long.

I say "something along" because there are several more ways to write
it.  I'm not really a big fan of most of them, and 0x7FFFFFFF (or in
lowercase) is my own preference here too.

Chris

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v2 03/18] maintenance: replace run_auto_gc()
  2020-07-30 19:00           ` Eric Sunshine
@ 2020-07-30 20:21             ` Derrick Stolee
  0 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee @ 2020-07-30 20:21 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, Git List,
	Johannes Schindelin, brian m. carlson, Josh Steadmon,
	Jonathan Nieder, Jeff King, Doan Tran Cong Danh, Phillip Wood,
	Emily Shaffer, Son Luong Ngoc, Jonathan Tan, Derrick Stolee,
	Derrick Stolee

On 7/30/2020 3:00 PM, Eric Sunshine wrote:
> On Thu, Jul 30, 2020 at 9:31 AM Derrick Stolee <stolee@gmail.com> wrote:
>> int run_auto_maintenance(int quiet)
>> {
>>         struct child_process maint = CHILD_PROCESS_INIT;
>>         maint.git_cmd = 1;
>>
>>         argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>>         if (quiet)
>>                 argv_array_push(&maint.args, "--quiet");
>>         else
>>                 argv_array_push(&maint.args, "--no-quiet");
> 
> It's subjective, but this might be a good fit for the ternary operator:
> 
>     argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>     argv_array_push(&maint.args, quiet ? "--quiet" : "--no-quiet");

Good idea! Thanks.

> Or even:
> 
>     argv_array_pushl(&maint.args, "maintenance", "run", "--auto",
>         quiet ? "--quiet" : "--no-quiet", NULL);
> 
> The latter is a bit less readable to me.

I agree.
-Stolee

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 00/20] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-23 17:56 ` [PATCH v2 00/18] " Derrick Stolee via GitGitGadget
                     ` (18 preceding siblings ...)
  2020-07-29 22:03   ` [PATCH v2 00/18] Maintenance builtin, allowing 'gc --auto' customization Emily Shaffer
@ 2020-07-30 22:24   ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 01/20] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
                       ` (22 more replies)
  19 siblings, 23 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee

This is a second attempt at redesigning Git's repository maintenance
patterns. The first attempt [1] included a way to run jobs in the background
using a long-lived process; that idea was rejected and is not included in
this series. A future series will use the OS to handle scheduling tasks.

[1] 
https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/

As mentioned before, git gc already plays the role of maintaining Git
repositories. It has accumulated several smaller pieces in its long history,
including:

 1. Repacking all reachable objects into one pack-file (and deleting
    unreachable objects).
 2. Packing refs.
 3. Expiring reflogs.
 4. Clearing rerere logs.
 5. Updating the commit-graph file.
 6. Pruning worktrees.

While expiring reflogs, clearing rererelogs, and deleting unreachable
objects are suitable under the guise of "garbage collection", packing refs
and updating the commit-graph file are not as obviously fitting. Further,
these operations are "all or nothing" in that they rewrite almost all
repository data, which does not perform well at extremely large scales.
These operations can also be disruptive to foreground Git commands when git
gc --auto triggers during routine use.

This series does not intend to change what git gc does, but instead create
new choices for automatic maintenance activities, of which git gc remains
the only one enabled by default.

The new maintenance tasks are:

 * 'commit-graph' : write and verify a single layer of an incremental
   commit-graph.
 * 'loose-objects' : prune packed loose objects, then create a new pack from
   a batch of loose objects.
 * 'pack-files' : expire redundant packs from the multi-pack-index, then
   repack using the multi-pack-index's incremental repack strategy.
 * 'fetch' : fetch from each remote, storing the refs in 'refs/prefetch//'.

These tasks are all disabled by default, but can be enabled with config
options or run explicitly using "git maintenance run --task=". There are
additional config options to allow customizing the conditions for which the
tasks run during the '--auto' option. ('fetch' will never run with the
'--auto' option.)

 Because 'gc' is implemented as a maintenance task, the most dramatic change
of this series is to convert the 'git gc --auto' calls into 'git maintenance
run --auto' calls at the end of some Git commands. By default, the only
change is that 'git gc --auto' will be run below an additional 'git
maintenance' process.

The 'git maintenance' builtin has a 'run' subcommand so it can be extended
later with subcommands that manage background maintenance, such as 'start',
'stop', 'pause', or 'schedule'. These are not the subject of this series, as
it is important to focus on the maintenance activities themselves.

An expert user could set up scheduled background maintenance themselves with
the current series. I have the following crontab data set up to run
maintenance on an hourly basis:

0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

My config includes all tasks except the 'gc' task. The hourly run is
over-aggressive, but is sufficient for testing. I'll replace it with daily
when I feel satisfied.

Hopefully this direction is seen as a positive one. My goal was to add more
options for expert users, along with the flexibility to create background
maintenance via the OS in a later series.

OUTLINE
=======

Patches 1-4 remove some references to the_repository in builtin/gc.c before
we start depending on code in that builtin.

Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
commands.

Patches 8-15 create new maintenance tasks. These are the same tasks sent in
the previous RFC.

Patches 16-21 create more customization through config and perform other
polish items.

FUTURE WORK
===========

 * Add 'start', 'stop', and 'schedule' subcommands to initialize the
   commands run in the background. You can see my progress towards this goal
   here: https://github.com/gitgitgadget/git/pull/680
   
   
 * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
   default, but might have different '--auto' conditions and more config
   options.
   
   
 * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
   with use of the 'commit-graph' task.
   
   
 * Update the builtin to use struct repository *r properly, especially when
   calling subcommands.
   
   

UPDATES in V2
=============

I'm sending this between v2.28.0-rc2 adn v2.28.0 as the release things have
become a bit quiet.

 * The biggest disruption to the range-diff is that I removed the premature
   use of struct repository *r and instead continue to rely on 
   the_repository. This means several patches were dropped that did prep
   work in builtin/gc.c.
   
   
 * I dropped the task hashmap and opted for a linear scan. This task list
   will always be too small to justify the extra complication of the
   hashmap.
   
   
 * struct maintenance_opts is properly static now.
   
   
 * Some tasks are renamed: fetch -> prefetch, pack-files ->
   incremental-repack.
   
   
 * With the rename, the prefetch task uses refs/prefetch/ instead of 
   refs/hidden/.
   
   
 * A trace2 region around the task executions are added.
   
   

UPDATES in V3
=============

 * This series is now based on jk/strvec, as there are several places where
   I was adding new callers to argv_array_push* and run_command_v_opt()
   which have been replaced with strvec_push*() and run_command(), using a
   'struct child_process'.
   
   
 * I added and updates Junio's patch from jc/no-update-fetch-head into the
   proper place before the 'prefetch' task. The 'prefetch' task now uses
   --no-write-fetch-head to avoid issues with FETCH_HEAD.
   
   
 * Since there were concerns around core.multiPackIndex, I added some extra
   care checking for that config being enabled. Since that already needed to
   adjust the config value from its existing implementation in midx.c, I
   added it to repo-settings and made it enabled by default.
   
   
 * Lots of feedback from the previous round. Thanks, all! I fully expect to
   have at least one more round of feedback, but things are improving quite
   a lot.
   
   

Thanks, -Stolee

Derrick Stolee (19):
  maintenance: create basic maintenance runner
  maintenance: add --quiet option
  maintenance: replace run_auto_gc()
  maintenance: initialize task array
  maintenance: add commit-graph task
  maintenance: add --task option
  maintenance: take a lock on the objects directory
  maintenance: add prefetch task
  maintenance: add loose-objects task
  midx: enable core.multiPackIndex by default
  maintenance: add incremental-repack task
  maintenance: auto-size incremental-repack batch
  maintenance: create maintenance.<task>.enabled config
  maintenance: use pointers to check --auto
  maintenance: add auto condition for commit-graph task
  maintenance: create auto condition for loose-objects
  maintenance: add incremental-repack auto condition
  midx: use start_delayed_progress()
  maintenance: add trace2 regions for task execution

Junio C Hamano (1):
  fetch: optionally allow disabling FETCH_HEAD update

 .gitignore                           |   1 +
 Documentation/config.txt             |   2 +
 Documentation/config/core.txt        |   4 +-
 Documentation/config/fetch.txt       |   7 +
 Documentation/config/maintenance.txt |  32 ++
 Documentation/fetch-options.txt      |  16 +-
 Documentation/git-clone.txt          |   6 +-
 Documentation/git-maintenance.txt    | 127 +++++
 builtin.h                            |   1 +
 builtin/am.c                         |   2 +-
 builtin/commit.c                     |   2 +-
 builtin/fetch.c                      |  25 +-
 builtin/gc.c                         | 713 +++++++++++++++++++++++++++
 builtin/merge.c                      |   2 +-
 builtin/pull.c                       |   3 +-
 builtin/rebase.c                     |   4 +-
 commit-graph.c                       |   8 +-
 commit-graph.h                       |   1 +
 git.c                                |   1 +
 midx.c                               |  23 +-
 midx.h                               |   1 +
 object.h                             |   1 +
 repo-settings.c                      |   6 +
 repository.h                         |   2 +
 run-command.c                        |  16 +-
 run-command.h                        |   2 +-
 t/t5319-multi-pack-index.sh          |  15 +-
 t/t5510-fetch.sh                     |  41 +-
 t/t5514-fetch-multiple.sh            |   2 +-
 t/t5521-pull-options.sh              |  16 +
 t/t7900-maintenance.sh               | 215 ++++++++
 31 files changed, 1240 insertions(+), 57 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh


base-commit: 90dfbf331c0ade4c15c74276c466e32e384f9ceb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/671

Range-diff vs v2:

  1:  63ec602a07 !  1:  12fe73bb72 maintenance: create basic maintenance runner
     @@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
      +
      +static int maintenance_task_gc(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     ++	struct child_process child = CHILD_PROCESS_INIT;
      +
     -+	argv_array_pushl(&cmd, "gc", NULL);
     ++	child.git_cmd = 1;
     ++	strvec_push(&child.args, "gc");
      +
      +	if (opts.auto_flag)
     -+		argv_array_pushl(&cmd, "--auto", NULL);
     ++		strvec_push(&child.args, "--auto");
      +
      +	close_object_store(the_repository->objects);
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
     -+
     -+	return result;
     ++	return run_command(&child);
      +}
      +
      +static int maintenance_run(void)
     @@ t/t7900-maintenance.sh (new)
      +
      +test_description='git maintenance builtin'
      +
     -+GIT_TEST_COMMIT_GRAPH=0
     -+GIT_TEST_MULTI_PACK_INDEX=0
     -+
      +. ./test-lib.sh
      +
      +test_expect_success 'help text' '
     -+	test_must_fail git maintenance -h 2>err &&
     ++	test_expect_code 129 git maintenance -h 2>err &&
      +	test_i18ngrep "usage: git maintenance run" err
      +'
      +
     -+test_expect_success 'gc [--auto]' '
     ++test_expect_success 'run [--auto]' '
      +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
      +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
      +	grep ",\"gc\"]" run-no-auto.txt  &&
  2:  1d37e55cb7 !  2:  6e533e43d7 maintenance: add --quiet option
     @@ builtin/gc.c: static const char * const builtin_maintenance_usage[] = {
      @@ builtin/gc.c: static int maintenance_task_gc(void)
       
       	if (opts.auto_flag)
     - 		argv_array_pushl(&cmd, "--auto", NULL);
     + 		strvec_push(&child.args, "--auto");
      +	if (opts.quiet)
     -+		argv_array_pushl(&cmd, "--quiet", NULL);
     ++		strvec_push(&child.args, "--quiet");
       
       	close_object_store(the_repository->objects);
     - 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     + 	return run_command(&child);
      @@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
       	static struct option builtin_maintenance_options[] = {
       		OPT_BOOL(0, "auto", &opts.auto_flag,
     @@ t/t7900-maintenance.sh: test_expect_success 'help text' '
       	test_i18ngrep "usage: git maintenance run" err
       '
       
     --test_expect_success 'gc [--auto]' '
     +-test_expect_success 'run [--auto]' '
      -	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
     -+test_expect_success 'gc [--auto|--quiet]' '
     ++test_expect_success 'run [--auto|--quiet]' '
      +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
       	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
      +	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
  3:  f164d1a0b4 !  3:  c4674fc211 maintenance: replace run_auto_gc()
     @@ Commit message
      
          Rename run_auto_gc() to run_auto_maintenance() to be clearer what is
          happening on this call, and to expose all callers in the current diff.
     +    Rewrite the method to use a struct child_process to simplify the calls
     +    slightly.
      
          Since 'git fetch' already allows disabling the 'git gc --auto'
          subprocess, add an equivalent option with a different name to be more
     @@ Documentation/fetch-options.txt: ifndef::git-pull[]
       	Allow several <repository> and <group> arguments to be
       	specified. No <refspec>s may be specified.
       
     -+--[no-]maintenance::
     ++--[no-]auto-maintenance::
       --[no-]auto-gc::
      -	Run `git gc --auto` at the end to perform garbage collection
      -	if needed. This is enabled by default.
     -+	Run `git maintenance run --auto` at the end to perform garbage
     -+	collection if needed. This is enabled by default.
     ++	Run `git maintenance run --auto` at the end to perform automatic
     ++	repository maintenance if needed. (`--[no-]auto-gc` is a synonym.)
     ++	This is enabled by default.
       
       --[no-]write-commit-graph::
       	Write a commit-graph after fetching. This overrides the config
     @@ Documentation/git-clone.txt: repository using this option and then delete branch
      -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
      -If these objects are removed and were referenced by the cloned repository,
      -then the cloned repository will become corrupt.
     -+which automatically call `git maintenance run --auto` and `git gc --auto`.
     -+(See linkgit:git-maintenance[1] and linkgit:git-gc[1].) If these objects
     -+are removed and were referenced by the cloned repository, then the cloned
     -+repository will become corrupt.
     ++which automatically call `git maintenance run --auto`. (See
     ++linkgit:git-maintenance[1].) If these objects are removed and were referenced
     ++by the cloned repository, then the cloned repository will become corrupt.
       +
       Note that running `git repack` without the `--local` option in a repository
       cloned with `--shared` will copy objects from the source repository into a pack
     @@ builtin/fetch.c: static struct option builtin_fetch_options[] = {
       	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
       			N_("report that we have only objects reachable from this object")),
       	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
     -+	OPT_BOOL(0, "maintenance", &enable_auto_gc,
     ++	OPT_BOOL(0, "auto-maintenance", &enable_auto_gc,
      +		 N_("run 'maintenance --auto' after fetching")),
       	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
      -		 N_("run 'gc --auto' after fetching")),
     @@ run-command.c: int run_processes_parallel_tr2(int n, get_next_task_fn get_next_t
      -int run_auto_gc(int quiet)
      +int run_auto_maintenance(int quiet)
       {
     - 	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
     - 	int status;
     +-	struct strvec argv_gc_auto = STRVEC_INIT;
     +-	int status;
     ++	struct child_process maint = CHILD_PROCESS_INIT;
       
     --	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
     -+	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
     - 	if (quiet)
     - 		argv_array_push(&argv_gc_auto, "--quiet");
     -+	else
     -+		argv_array_push(&argv_gc_auto, "--no-quiet");
     +-	strvec_pushl(&argv_gc_auto, "gc", "--auto", NULL);
     +-	if (quiet)
     +-		strvec_push(&argv_gc_auto, "--quiet");
     +-	status = run_command_v_opt(argv_gc_auto.items, RUN_GIT_CMD);
     +-	strvec_clear(&argv_gc_auto);
     +-	return status;
     ++	maint.git_cmd = 1;
     ++	strvec_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
     ++	strvec_push(&maint.args, quiet ? "--quiet" : "--no-quiet");
      +
     - 	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
     - 	argv_array_clear(&argv_gc_auto);
     - 	return status;
     ++	return run_command(&maint);
     + }
      
       ## run-command.h ##
      @@ run-command.h: int run_hook_ve(const char *const *env, const char *name, va_list args);
  4:  8e260bccf1 !  4:  b9332c1318 maintenance: initialize task array
     @@ Commit message
          future command-line argument) along with a function pointer to its
          implementation and a boolean for whether the step is enabled.
      
     -    A list of pointers to these structs are initialized with the full list
     -    of implemented tasks along with a default order. For now, this list only
     -    contains the "gc" task. This task is also the only task enabled by
     -    default.
     +    A list these structs are initialized with the full list of implemented
     +    tasks along with a default order. For now, this list only contains the
     +    "gc" task. This task is also the only task enabled by default.
      
     +    The run subcommand will return a nonzero exit code if any task fails.
     +    However, it will attempt all tasks in its loop before returning with the
     +    failure. Also each failed task will send an error message.
     +
     +    Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
     - 	return 0;
     - }
     - 
     -+#define MAX_NUM_TASKS 1
     -+
     - static const char * const builtin_maintenance_usage[] = {
     - 	N_("git maintenance run [<options>]"),
     - 	NULL
      @@ builtin/gc.c: static int maintenance_task_gc(void)
     - 	return result;
     + 	return run_command(&child);
       }
       
      +typedef int maintenance_task_fn(void);
     @@ builtin/gc.c: static int maintenance_task_gc(void)
      +	unsigned enabled:1;
      +};
      +
     -+static struct maintenance_task *tasks[MAX_NUM_TASKS];
     -+static int num_tasks;
     ++enum maintenance_task_label {
     ++	TASK_GC,
     ++
     ++	/* Leave as final value */
     ++	TASK__COUNT
     ++};
     ++
     ++static struct maintenance_task tasks[] = {
     ++	[TASK_GC] = {
     ++		"gc",
     ++		maintenance_task_gc,
     ++		1,
     ++	},
     ++};
      +
       static int maintenance_run(void)
       {
     @@ builtin/gc.c: static int maintenance_task_gc(void)
      +	int i;
      +	int result = 0;
      +
     -+	for (i = 0; !result && i < num_tasks; i++) {
     -+		if (!tasks[i]->enabled)
     ++	for (i = 0; i < TASK__COUNT; i++) {
     ++		if (!tasks[i].enabled)
      +			continue;
     -+		result = tasks[i]->fn();
     ++
     ++		if (tasks[i].fn()) {
     ++			error(_("task '%s' failed"), tasks[i].name);
     ++			result = 1;
     ++		}
      +	}
      +
      +	return result;
     -+}
     -+
     -+static void initialize_tasks(void)
     -+{
     -+	int i;
     -+	num_tasks = 0;
     -+
     -+	for (i = 0; i < MAX_NUM_TASKS; i++)
     -+		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
     -+
     -+	tasks[num_tasks]->name = "gc";
     -+	tasks[num_tasks]->fn = maintenance_task_gc;
     -+	tasks[num_tasks]->enabled = 1;
     -+	num_tasks++;
       }
       
       int cmd_maintenance(int argc, const char **argv, const char *prefix)
     -@@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
     - 				   builtin_maintenance_options);
     - 
     - 	opts.quiet = !isatty(2);
     -+	initialize_tasks();
     - 
     - 	argc = parse_options(argc, argv, prefix,
     - 			     builtin_maintenance_options,
  5:  04552b1d2e !  5:  a4d9836bed maintenance: add commit-graph task
     @@ Documentation/git-maintenance.txt: run::
       	stands for "garbage collection," but this task performs many
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
     - 	return 0;
     - }
     - 
     --#define MAX_NUM_TASKS 1
     -+#define MAX_NUM_TASKS 2
     - 
     - static const char * const builtin_maintenance_usage[] = {
     - 	N_("git maintenance run [<options>]"),
      @@ builtin/gc.c: static struct maintenance_opts {
       	int quiet;
       } opts;
       
      +static int run_write_commit_graph(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     ++	struct child_process child = CHILD_PROCESS_INIT;
      +
     -+	argv_array_pushl(&cmd, "commit-graph", "write",
     -+			 "--split", "--reachable", NULL);
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "commit-graph", "write",
     ++		     "--split", "--reachable", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     -+
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
     ++		strvec_push(&child.args, "--no-progress");
      +
     -+	return result;
     ++	return !!run_command(&child);
      +}
      +
      +static int run_verify_commit_graph(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     ++	struct child_process child = CHILD_PROCESS_INIT;
      +
     -+	argv_array_pushl(&cmd, "commit-graph", "verify",
     -+			 "--shallow", NULL);
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "commit-graph", "verify",
     ++		     "--shallow", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     ++		strvec_push(&child.args, "--no-progress");
      +
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
     -+
     -+	return result;
     ++	return !!run_command(&child);
      +}
      +
      +static int maintenance_task_commit_graph(void)
     @@ builtin/gc.c: static struct maintenance_opts {
      +	struct repository *r = the_repository;
      +	char *chain_path;
      +
     -+	/* Skip commit-graph when --auto is specified. */
     -+	if (opts.auto_flag)
     -+		return 0;
     -+
      +	close_object_store(r->objects);
      +	if (run_write_commit_graph()) {
      +		error(_("failed to write commit-graph"));
     @@ builtin/gc.c: static struct maintenance_opts {
      +
       static int maintenance_task_gc(void)
       {
     - 	int result;
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	tasks[num_tasks]->fn = maintenance_task_gc;
     - 	tasks[num_tasks]->enabled = 1;
     - 	num_tasks++;
     -+
     -+	tasks[num_tasks]->name = "commit-graph";
     -+	tasks[num_tasks]->fn = maintenance_task_commit_graph;
     -+	num_tasks++;
     - }
     + 	struct child_process child = CHILD_PROCESS_INIT;
     +@@ builtin/gc.c: struct maintenance_task {
     + 
     + enum maintenance_task_label {
     + 	TASK_GC,
     ++	TASK_COMMIT_GRAPH,
       
     - int cmd_maintenance(int argc, const char **argv, const char *prefix)
     + 	/* Leave as final value */
     + 	TASK__COUNT
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 		maintenance_task_gc,
     + 		1,
     + 	},
     ++	[TASK_COMMIT_GRAPH] = {
     ++		"commit-graph",
     ++		maintenance_task_commit_graph,
     ++	},
     + };
     + 
     + static int maintenance_run(void)
      
       ## commit-graph.c ##
      @@ commit-graph.c: static char *get_split_graph_filename(struct object_directory *odb,
     @@ commit-graph.h: struct commit;
       /*
      
       ## t/t7900-maintenance.sh ##
     -@@ t/t7900-maintenance.sh: test_expect_success 'help text' '
     - 	test_i18ngrep "usage: git maintenance run" err
     - '
     +@@ t/t7900-maintenance.sh: test_description='git maintenance builtin'
       
     --test_expect_success 'gc [--auto|--quiet]' '
     -+test_expect_success 'run [--auto|--quiet]' '
     - 	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
     - 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
     - 	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
     + . ./test-lib.sh
     + 
     ++GIT_TEST_COMMIT_GRAPH=0
     ++
     + test_expect_success 'help text' '
     + 	test_expect_code 129 git maintenance -h 2>err &&
     + 	test_i18ngrep "usage: git maintenance run" err
  6:  a09b1c1687 !  6:  dafb0d9bbc maintenance: add --task option
     @@ Commit message
          references. We use the hashmap to match the --task=<task> arguments into
          the task struct data.
      
     +    Keep in mind that the 'enabled' member of the maintenance_task struct is
     +    a placeholder for a future 'maintenance.<task>.enabled' config option.
     +    Thus, we use the 'enabled' member to specify which tasks are run when
     +    the user does not specify any --task=<task> arguments. The 'enabled'
     +    member should be ignored if --task=<task> appears.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Documentation/git-maintenance.txt ##
     @@ builtin/gc.c: typedef int maintenance_task_fn(void);
       	const char *name;
       	maintenance_task_fn *fn;
      -	unsigned enabled:1;
     -+	int task_order;
      +	unsigned enabled:1,
      +		 selected:1;
     ++	int selected_order;
       };
       
     - static struct maintenance_task *tasks[MAX_NUM_TASKS];
     - static int num_tasks;
     + enum maintenance_task_label {
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 	},
     + };
       
      +static int compare_tasks_by_selection(const void *a_, const void *b_)
      +{
      +	const struct maintenance_task *a, *b;
     -+	a = (const struct maintenance_task *)a_;
     -+	b = (const struct maintenance_task *)b_;
      +
     -+	return b->task_order - a->task_order;
     ++	a = (const struct maintenance_task *)&a_;
     ++	b = (const struct maintenance_task *)&b_;
     ++
     ++	return b->selected_order - a->selected_order;
      +}
      +
       static int maintenance_run(void)
     @@ builtin/gc.c: typedef int maintenance_task_fn(void);
       	int result = 0;
       
      +	if (opts.tasks_selected)
     -+		QSORT(tasks, num_tasks, compare_tasks_by_selection);
     ++		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
      +
     - 	for (i = 0; !result && i < num_tasks; i++) {
     --		if (!tasks[i]->enabled)
     -+		if (opts.tasks_selected && !tasks[i]->selected)
     + 	for (i = 0; i < TASK__COUNT; i++) {
     +-		if (!tasks[i].enabled)
     ++		if (opts.tasks_selected && !tasks[i].selected)
      +			continue;
      +
     -+		if (!opts.tasks_selected && !tasks[i]->enabled)
     ++		if (!opts.tasks_selected && !tasks[i].enabled)
       			continue;
     -+
     - 		result = tasks[i]->fn();
     - 	}
       
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	num_tasks++;
     + 		if (tasks[i].fn()) {
     +@@ builtin/gc.c: static int maintenance_run(void)
     + 	return result;
       }
       
      +static int task_option_parse(const struct option *opt,
     @@ builtin/gc.c: static void initialize_tasks(void)
      +
      +	BUG_ON_OPT_NEG(unset);
      +
     -+	if (!arg || !strlen(arg)) {
     -+		error(_("--task requires a value"));
     -+		return 1;
     -+	}
     -+
      +	opts.tasks_selected++;
      +
     -+	for (i = 0; i < MAX_NUM_TASKS; i++) {
     -+		if (tasks[i] && !strcasecmp(tasks[i]->name, arg)) {
     -+			task = tasks[i];
     ++	for (i = 0; i < TASK__COUNT; i++) {
     ++		if (!strcasecmp(tasks[i].name, arg)) {
     ++			task = &tasks[i];
      +			break;
      +		}
      +	}
     @@ builtin/gc.c: static void initialize_tasks(void)
      +	}
      +
      +	task->selected = 1;
     -+	task->task_order = opts.tasks_selected;
     ++	task->selected_order = opts.tasks_selected;
      +
      +	return 0;
      +}
  7:  e9260a9c3f !  7:  1b00524da3 maintenance: take a lock on the objects directory
     @@ builtin/gc.c: static int maintenance_run(void)
      +	free(lock_path);
       
       	if (opts.tasks_selected)
     - 		QSORT(tasks, num_tasks, compare_tasks_by_selection);
     + 		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
      @@ builtin/gc.c: static int maintenance_run(void)
     - 		result = tasks[i]->fn();
     + 		}
       	}
       
      +	rollback_lock_file(&lk);
  -:  ---------- >  8:  0e94e04dcd fetch: optionally allow disabling FETCH_HEAD update
  8:  3165b8916d !  9:  9e38ade15c maintenance: add prefetch task
     @@ Commit message
      
          2. --refmap= removes the configured refspec which usually updates
             refs/remotes/<remote>/* with the refs advertised by the remote.
     +       While this looks confusing, this was documented and tested by
     +       b40a50264ac (fetch: document and test --refmap="", 2020-01-21),
     +       including this sentence in the documentation:
     +
     +            Providing an empty `<refspec>` to the `--refmap` option
     +            causes Git to ignore the configured refspecs and rely
     +            entirely on the refspecs supplied as command-line arguments.
      
          3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
             we can ensure that we actually load the new values somewhere in
     @@ Commit message
          4. --prune will delete the refs/prefetch/<remote> refs that no
             longer appear on the remote.
      
     +    5. --no-write-fetch-head prevents updating FETCH_HEAD.
     +
          We've been using this step as a critical background job in Scalar
          [1] (and VFS for Git). This solved a pain point that was showing up
          in user reports: fetching was a pain! Users do not like waiting to
     @@ Documentation/git-maintenance.txt: since it will not expire `.graph` files that
       the expiration delay.
       
      +prefetch::
     -+	The `fetch` task updates the object directory with the latest objects
     -+	from all registered remotes. For each remote, a `git fetch` command
     -+	is run. The refmap is custom to avoid updating local or remote
     ++	The `prefetch` task updates the object directory with the latest
     ++	objects from all registered remotes. For each remote, a `git fetch`
     ++	command is run. The refmap is custom to avoid updating local or remote
      +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
      +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
      +	not updated.
      ++
     -+This means that foreground fetches are still required to update the
     -+remote refs, but the users is notified when the branches and tags are
     -+updated on the remote.
     ++This is done to avoid disrupting the remote-tracking branches. The end users
     ++expect these refs to stay unmoved unless they initiate a fetch.  With prefetch
     ++task, however, the objects necessary to complete a later real fetch would
     ++already be obtained, so the real fetch would go faster.  In the ideal case,
     ++it will just become an update to bunch of remote-tracking branches without
     ++any object transfer.
      +
       gc::
       	Cleanup unnecessary files and optimize the local repository. "GC"
     @@ builtin/gc.c
       
       #define FAILED_RUN "failed to run %s"
       
     -@@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
     - 	return 0;
     - }
     - 
     --#define MAX_NUM_TASKS 2
     -+#define MAX_NUM_TASKS 3
     - 
     - static const char * const builtin_maintenance_usage[] = {
     - 	N_("git maintenance run [<options>]"),
      @@ builtin/gc.c: static int maintenance_task_commit_graph(void)
       	return 1;
       }
       
      +static int fetch_remote(const char *remote)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	struct strbuf refmap = STRBUF_INIT;
     ++	struct child_process child = CHILD_PROCESS_INIT;
      +
     -+	argv_array_pushl(&cmd, "fetch", remote, "--prune",
     -+			 "--no-tags", "--refmap=", NULL);
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
     ++		     "--no-write-fetch-head", "--refmap=", NULL);
      +
     -+	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
     -+	argv_array_push(&cmd, refmap.buf);
     ++	strvec_pushf(&child.args, "+refs/heads/*:refs/prefetch/%s/*", remote);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--quiet");
     ++		strvec_push(&child.args, "--quiet");
      +
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+
     -+	strbuf_release(&refmap);
     -+	return result;
     ++	return !!run_command(&child);
      +}
      +
      +static int fill_each_remote(struct remote *remote, void *cbdata)
     @@ builtin/gc.c: static int maintenance_task_commit_graph(void)
      +		goto cleanup;
      +	}
      +
     -+	/*
     -+	 * Do not modify the result based on the success of the 'fetch'
     -+	 * operation, as a loss of network could cause 'fetch' to fail
     -+	 * quickly. We do not want that to stop the rest of our
     -+	 * background operations.
     -+	 */
      +	for (item = remotes.items;
      +	     item && item < remotes.items + remotes.nr;
      +	     item++)
     -+		fetch_remote(item->string);
     ++		result |= fetch_remote(item->string);
      +
      +cleanup:
      +	string_list_clear(&remotes, 0);
     @@ builtin/gc.c: static int maintenance_task_commit_graph(void)
      +
       static int maintenance_task_gc(void)
       {
     - 	int result;
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	for (i = 0; i < MAX_NUM_TASKS; i++)
     - 		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
     + 	struct child_process child = CHILD_PROCESS_INIT;
     +@@ builtin/gc.c: struct maintenance_task {
     + };
       
     -+	tasks[num_tasks]->name = "prefetch";
     -+	tasks[num_tasks]->fn = maintenance_task_prefetch;
     -+	num_tasks++;
     -+
     - 	tasks[num_tasks]->name = "gc";
     - 	tasks[num_tasks]->fn = maintenance_task_gc;
     - 	tasks[num_tasks]->enabled = 1;
     + enum maintenance_task_label {
     ++	TASK_PREFETCH,
     + 	TASK_GC,
     + 	TASK_COMMIT_GRAPH,
     + 
     +@@ builtin/gc.c: enum maintenance_task_label {
     + };
     + 
     + static struct maintenance_task tasks[] = {
     ++	[TASK_PREFETCH] = {
     ++		"prefetch",
     ++		maintenance_task_prefetch,
     ++	},
     + 	[TASK_GC] = {
     + 		"gc",
     + 		maintenance_task_gc,
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'run --task duplicate' '
  9:  83648f4865 ! 10:  0128fdfd1a maintenance: add loose-objects task
     @@ Documentation/git-maintenance.txt: gc::
       --auto::
      
       ## builtin/gc.c ##
     -@@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
     - 	return 0;
     - }
     - 
     --#define MAX_NUM_TASKS 3
     -+#define MAX_NUM_TASKS 4
     - 
     - static const char * const builtin_maintenance_usage[] = {
     - 	N_("git maintenance run [<options>]"),
      @@ builtin/gc.c: static int maintenance_task_gc(void)
     - 	return result;
     + 	return run_command(&child);
       }
       
      +static int prune_packed(void)
      +{
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "prune-packed", NULL);
     ++	struct child_process child = CHILD_PROCESS_INIT;
     ++
     ++	child.git_cmd = 1;
     ++	strvec_push(&child.args, "prune-packed");
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--quiet");
     ++		strvec_push(&child.args, "--quiet");
      +
     -+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     ++	return !!run_command(&child);
      +}
      +
      +struct write_loose_object_data {
     @@ builtin/gc.c: static int maintenance_task_gc(void)
      +	int batch_size;
      +};
      +
     -+static int loose_object_exists(const struct object_id *oid,
     -+			       const char *path,
     -+			       void *data)
     ++static int bail_on_loose(const struct object_id *oid,
     ++			 const char *path,
     ++			 void *data)
      +{
      +	return 1;
      +}
     @@ builtin/gc.c: static int maintenance_task_gc(void)
      +	struct repository *r = the_repository;
      +	int result = 0;
      +	struct write_loose_object_data data;
     -+	struct strbuf prefix = STRBUF_INIT;
     -+	struct child_process *pack_proc;
     ++	struct child_process pack_proc = CHILD_PROCESS_INIT;
      +
      +	/*
      +	 * Do not start pack-objects process
      +	 * if there are no loose objects.
      +	 */
      +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
     -+					   loose_object_exists,
     ++					   bail_on_loose,
      +					   NULL, NULL, NULL))
      +		return 0;
      +
     -+	pack_proc = xmalloc(sizeof(*pack_proc));
     -+
     -+	child_process_init(pack_proc);
     ++	pack_proc.git_cmd = 1;
      +
     -+	strbuf_addstr(&prefix, r->objects->odb->path);
     -+	strbuf_addstr(&prefix, "/pack/loose");
     -+
     -+	argv_array_pushl(&pack_proc->args, "git", "pack-objects", NULL);
     ++	strvec_push(&pack_proc.args, "pack-objects");
      +	if (opts.quiet)
     -+		argv_array_push(&pack_proc->args, "--quiet");
     -+	argv_array_push(&pack_proc->args, prefix.buf);
     ++		strvec_push(&pack_proc.args, "--quiet");
     ++	strvec_pushf(&pack_proc.args, "%s/pack/loose", r->objects->odb->path);
      +
     -+	pack_proc->in = -1;
     ++	pack_proc.in = -1;
      +
     -+	if (start_command(pack_proc)) {
     ++	if (start_command(&pack_proc)) {
      +		error(_("failed to start 'git pack-objects' process"));
     -+		result = 1;
     -+		goto cleanup;
     ++		return 1;
      +	}
      +
     -+	data.in = xfdopen(pack_proc->in, "w");
     ++	data.in = xfdopen(pack_proc.in, "w");
      +	data.count = 0;
      +	data.batch_size = 50000;
      +
     @@ builtin/gc.c: static int maintenance_task_gc(void)
      +
      +	fclose(data.in);
      +
     -+	if (finish_command(pack_proc)) {
     ++	if (finish_command(&pack_proc)) {
      +		error(_("failed to finish 'git pack-objects' process"));
      +		result = 1;
      +	}
      +
     -+cleanup:
     -+	strbuf_release(&prefix);
     -+	free(pack_proc);
      +	return result;
      +}
      +
     @@ builtin/gc.c: static int maintenance_task_gc(void)
       typedef int maintenance_task_fn(void);
       
       struct maintenance_task {
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	tasks[num_tasks]->fn = maintenance_task_prefetch;
     - 	num_tasks++;
     +@@ builtin/gc.c: struct maintenance_task {
       
     -+	tasks[num_tasks]->name = "loose-objects";
     -+	tasks[num_tasks]->fn = maintenance_task_loose_objects;
     -+	num_tasks++;
     -+
     - 	tasks[num_tasks]->name = "gc";
     - 	tasks[num_tasks]->fn = maintenance_task_gc;
     - 	tasks[num_tasks]->enabled = 1;
     + enum maintenance_task_label {
     + 	TASK_PREFETCH,
     ++	TASK_LOOSE_OBJECTS,
     + 	TASK_GC,
     + 	TASK_COMMIT_GRAPH,
     + 
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 		"prefetch",
     + 		maintenance_task_prefetch,
     + 	},
     ++	[TASK_LOOSE_OBJECTS] = {
     ++		"loose-objects",
     ++		maintenance_task_loose_objects,
     ++	},
     + 	[TASK_GC] = {
     + 		"gc",
     + 		maintenance_task_gc,
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'prefetch multiple remotes' '
     @@ t/t7900-maintenance.sh: test_expect_success 'prefetch multiple remotes' '
      +
      +	# Hack to stop maintenance from running during "git commit"
      +	echo in use >.git/objects/maintenance.lock &&
     ++
     ++	# Assuming that "git commit" creates at least one loose object
      +	test_commit create-loose-object &&
      +	rm .git/objects/maintenance.lock &&
      +
     @@ t/t7900-maintenance.sh: test_expect_success 'prefetch multiple remotes' '
      +	test_cmp obj-dir-before obj-dir-between &&
      +	ls .git/objects/pack/*.pack >packs-between &&
      +	test_line_count = 2 packs-between &&
     ++	ls .git/objects/pack/loose-*.pack >loose-packs &&
     ++	test_line_count = 1 loose-packs &&
      +
      +	# The second run deletes loose objects
      +	# but does not create a pack-file.
  -:  ---------- > 11:  c2baf6e119 midx: enable core.multiPackIndex by default
 10:  b6328c2106 ! 12:  00f47c4848 maintenance: add incremental-repack task
     @@ Commit message
             size" is calculated by taking the size of the pack-file divided
             by the number of objects in the pack-file and multiplied by the
             number of objects from the multi-pack-index with offset in that
     -       pack-file. The expected size approximats how much data from that
     +       pack-file. The expected size approximates how much data from that
             pack-file will contribute to the resulting pack-file size. The
             intention is that the resulting pack-file will be close in size
             to the provided batch size.
     @@ builtin/gc.c
       
       #define FAILED_RUN "failed to run %s"
       
     -@@ builtin/gc.c: int cmd_gc(int argc, const char **argv, const char *prefix)
     - 	return 0;
     - }
     - 
     --#define MAX_NUM_TASKS 4
     -+#define MAX_NUM_TASKS 5
     - 
     - static const char * const builtin_maintenance_usage[] = {
     - 	N_("git maintenance run [<options>]"),
      @@ builtin/gc.c: static int maintenance_task_loose_objects(void)
       	return prune_packed() || pack_loose();
       }
       
      +static int multi_pack_index_write(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "multi-pack-index", "write", NULL);
     ++	struct child_process child = CHILD_PROCESS_INIT;
     ++
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     ++		strvec_push(&child.args, "--no-progress");
      +
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
     ++	if (run_command(&child))
     ++		return error(_("failed to write multi-pack-index"));
      +
     -+	return result;
     ++	return 0;
      +}
      +
      +static int rewrite_multi_pack_index(void)
     @@ builtin/gc.c: static int maintenance_task_loose_objects(void)
      +	unlink(midx_name);
      +	free(midx_name);
      +
     -+	if (multi_pack_index_write()) {
     -+		error(_("failed to rewrite multi-pack-index"));
     -+		return 1;
     -+	}
     -+
     -+	return 0;
     ++	return multi_pack_index_write();
      +}
      +
     -+static int multi_pack_index_verify(void)
     ++static int multi_pack_index_verify(const char *message)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "multi-pack-index", "verify", NULL);
     ++	struct child_process child = CHILD_PROCESS_INIT;
     ++
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "multi-pack-index", "verify", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     ++		strvec_push(&child.args, "--no-progress");
      +
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
     ++	if (run_command(&child)) {
     ++		warning(_("'git multi-pack-index verify' failed %s"), message);
     ++		return 1;
     ++	}
      +
     -+	return result;
     ++	return 0;
      +}
      +
      +static int multi_pack_index_expire(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "multi-pack-index", "expire", NULL);
     ++	struct child_process child = CHILD_PROCESS_INIT;
     ++
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     ++		strvec_push(&child.args, "--no-progress");
      +
      +	close_object_store(the_repository->objects);
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	argv_array_clear(&cmd);
      +
     -+	return result;
     ++	if (run_command(&child))
     ++		return error(_("'git multi-pack-index expire' failed"));
     ++
     ++	return 0;
      +}
      +
      +static int multi_pack_index_repack(void)
      +{
     -+	int result;
     -+	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
     ++	struct child_process child = CHILD_PROCESS_INIT;
     ++
     ++	child.git_cmd = 1;
     ++	strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
      +
      +	if (opts.quiet)
     -+		argv_array_push(&cmd, "--no-progress");
     ++		strvec_push(&child.args, "--no-progress");
      +
     -+	argv_array_push(&cmd, "--batch-size=0");
     ++	strvec_push(&child.args, "--batch-size=0");
      +
      +	close_object_store(the_repository->objects);
     -+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
      +
     -+	if (result && multi_pack_index_verify()) {
     -+		warning(_("multi-pack-index verify failed after repack"));
     -+		result = rewrite_multi_pack_index();
     -+	}
     ++	if (run_command(&child))
     ++		return error(_("'git multi-pack-index repack' failed"));
      +
     -+	return result;
     ++	return 0;
      +}
      +
      +static int maintenance_task_incremental_repack(void)
      +{
     -+	if (multi_pack_index_write()) {
     -+		error(_("failed to write multi-pack-index"));
     -+		return 1;
     ++	prepare_repo_settings(the_repository);
     ++	if (!the_repository->settings.core_multi_pack_index) {
     ++		warning(_("skipping incremental-repack task because core.multiPackIndex is disabled"));
     ++		return 0;
      +	}
      +
     -+	if (multi_pack_index_verify()) {
     -+		warning(_("multi-pack-index verify failed after initial write"));
     -+		return rewrite_multi_pack_index();
     -+	}
     -+
     -+	if (multi_pack_index_expire()) {
     -+		error(_("multi-pack-index expire failed"));
     ++	if (multi_pack_index_write())
      +		return 1;
     -+	}
     -+
     -+	if (multi_pack_index_verify()) {
     -+		warning(_("multi-pack-index verify failed after expire"));
     ++	if (multi_pack_index_verify("after initial write"))
      +		return rewrite_multi_pack_index();
     -+	}
     -+
     -+	if (multi_pack_index_repack()) {
     -+		error(_("multi-pack-index repack failed"));
     ++	if (multi_pack_index_expire())
      +		return 1;
     -+	}
     -+
     ++	if (multi_pack_index_verify("after expire step"))
     ++		return !!rewrite_multi_pack_index();
     ++	if (multi_pack_index_repack())
     ++		return 1;
     ++	if (multi_pack_index_verify("after repack step"))
     ++		return !!rewrite_multi_pack_index();
      +	return 0;
      +}
      +
       typedef int maintenance_task_fn(void);
       
       struct maintenance_task {
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
     - 	num_tasks++;
     +@@ builtin/gc.c: struct maintenance_task {
     + enum maintenance_task_label {
     + 	TASK_PREFETCH,
     + 	TASK_LOOSE_OBJECTS,
     ++	TASK_INCREMENTAL_REPACK,
     + 	TASK_GC,
     + 	TASK_COMMIT_GRAPH,
       
     -+	tasks[num_tasks]->name = "incremental-repack";
     -+	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
     -+	num_tasks++;
     -+
     - 	tasks[num_tasks]->name = "gc";
     - 	tasks[num_tasks]->fn = maintenance_task_gc;
     - 	tasks[num_tasks]->enabled = 1;
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 		"loose-objects",
     + 		maintenance_task_loose_objects,
     + 	},
     ++	[TASK_INCREMENTAL_REPACK] = {
     ++		"incremental-repack",
     ++		maintenance_task_incremental_repack,
     ++	},
     + 	[TASK_GC] = {
     + 		"gc",
     + 		maintenance_task_gc,
      
       ## midx.c ##
      @@
     @@ midx.h: struct multi_pack_index {
       int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
       int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
      
     + ## t/t5319-multi-pack-index.sh ##
     +@@
     + test_description='multi-pack-indexes'
     + . ./test-lib.sh
     + 
     ++GIT_TEST_MULTI_PACK_INDEX=0
     + objdir=.git/objects
     + 
     + midx_read_expect () {
     +
       ## t/t7900-maintenance.sh ##
     +@@ t/t7900-maintenance.sh: test_description='git maintenance builtin'
     + . ./test-lib.sh
     + 
     + GIT_TEST_COMMIT_GRAPH=0
     ++GIT_TEST_MULTI_PACK_INDEX=0
     + 
     + test_expect_success 'help text' '
     + 	test_expect_code 129 git maintenance -h 2>err &&
      @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
       	test_cmp packs-between packs-after
       '
 11:  478c7f1d0b ! 13:  ef2a231956 maintenance: auto-size incremental-repack batch
     @@ Commit message
          When repacking during the 'incremental-repack' task, we use the
          --batch-size option in 'git multi-pack-index repack'. The initial setting
          used --batch-size=0 to repack everything into a single pack-file. This is
     -    not sustaintable for a large repository. The amount of work required is
     +    not sustainable for a large repository. The amount of work required is
          also likely to use too many system resources for a background job.
      
          Update the 'incremental-repack' task by dynamically computing a
     @@ Commit message
      
       ## builtin/gc.c ##
      @@ builtin/gc.c: static int multi_pack_index_expire(void)
     - 	return result;
     + 	return 0;
       }
       
     -+#define TWO_GIGABYTES (2147483647)
     -+#define UNSET_BATCH_SIZE ((unsigned long)-1)
     ++#define TWO_GIGABYTES (0x7FFF)
      +
      +static off_t get_auto_pack_size(void)
      +{
     @@ builtin/gc.c: static int multi_pack_index_expire(void)
      +
       static int multi_pack_index_repack(void)
       {
     - 	int result;
     - 	struct argv_array cmd = ARGV_ARRAY_INIT;
     -+	struct strbuf batch_arg = STRBUF_INIT;
     -+
     - 	argv_array_pushl(&cmd, "multi-pack-index", "repack", NULL);
     - 
     + 	struct child_process child = CHILD_PROCESS_INIT;
     +@@ builtin/gc.c: static int multi_pack_index_repack(void)
       	if (opts.quiet)
     - 		argv_array_push(&cmd, "--no-progress");
     + 		strvec_push(&child.args, "--no-progress");
       
     --	argv_array_push(&cmd, "--batch-size=0");
     -+	strbuf_addf(&batch_arg, "--batch-size=%"PRIuMAX,
     -+		    (uintmax_t)get_auto_pack_size());
     -+	argv_array_push(&cmd, batch_arg.buf);
     +-	strvec_push(&child.args, "--batch-size=0");
     ++	strvec_pushf(&child.args, "--batch-size=%"PRIuMAX,
     ++				  (uintmax_t)get_auto_pack_size());
       
       	close_object_store(the_repository->objects);
     - 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
     -+	strbuf_release(&batch_arg);
       
     - 	if (result && multi_pack_index_verify()) {
     - 		warning(_("multi-pack-index verify failed after repack"));
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'incremental-repack task' '
 12:  a3c64930a0 ! 14:  99840c4b8f maintenance: create maintenance.<task>.enabled config
     @@ Documentation/git-maintenance.txt: SUBCOMMANDS
      
       ## builtin/gc.c ##
      @@ builtin/gc.c: static int maintenance_run(void)
     - static void initialize_tasks(void)
     - {
     - 	int i;
     -+	struct strbuf config_name = STRBUF_INIT;
     - 	num_tasks = 0;
     + 	return result;
     + }
       
     - 	for (i = 0; i < MAX_NUM_TASKS; i++)
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 	tasks[num_tasks]->name = "commit-graph";
     - 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
     - 	num_tasks++;
     -+
     -+	for (i = 0; i < num_tasks; i++) {
     ++static void initialize_task_config(void)
     ++{
     ++	int i;
     ++	struct strbuf config_name = STRBUF_INIT;
     ++	for (i = 0; i < TASK__COUNT; i++) {
      +		int config_value;
      +
      +		strbuf_setlen(&config_name, 0);
     -+		strbuf_addf(&config_name, "maintenance.%s.enabled", tasks[i]->name);
     ++		strbuf_addf(&config_name, "maintenance.%s.enabled",
     ++			    tasks[i].name);
      +
      +		if (!git_config_get_bool(config_name.buf, &config_value))
     -+			tasks[i]->enabled = config_value;
     ++			tasks[i].enabled = config_value;
      +	}
      +
      +	strbuf_release(&config_name);
     - }
     - 
     ++}
     ++
       static int task_option_parse(const struct option *opt,
     + 			     const char *arg, int unset)
     + {
     +@@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
     + 				   builtin_maintenance_options);
     + 
     + 	opts.quiet = !isatty(2);
     ++	initialize_task_config();
     + 
     + 	argc = parse_options(argc, argv, prefix,
     + 			     builtin_maintenance_options,
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'run [--auto|--quiet]' '
 13:  dbacc2b76c ! 15:  a087c63572 maintenance: use pointers to check --auto
     @@ builtin/gc.c: static int maintenance_task_incremental_repack(void)
       	const char *name;
       	maintenance_task_fn *fn;
      +	maintenance_auto_fn *auto_condition;
     - 	int task_order;
       	unsigned enabled:1,
       		 selected:1;
     + 	int selected_order;
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 	[TASK_GC] = {
     + 		"gc",
     + 		maintenance_task_gc,
     ++		need_to_gc,
     + 		1,
     + 	},
     + 	[TASK_COMMIT_GRAPH] = {
      @@ builtin/gc.c: static int maintenance_run(void)
     - 		if (!opts.tasks_selected && !tasks[i]->enabled)
     + 		if (!opts.tasks_selected && !tasks[i].enabled)
       			continue;
       
      +		if (opts.auto_flag &&
     -+		    (!tasks[i]->auto_condition ||
     -+		     !tasks[i]->auto_condition()))
     ++		    (!tasks[i].auto_condition ||
     ++		     !tasks[i].auto_condition()))
      +			continue;
      +
     - 		result = tasks[i]->fn();
     - 	}
     - 
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 
     - 	tasks[num_tasks]->name = "gc";
     - 	tasks[num_tasks]->fn = maintenance_task_gc;
     -+	tasks[num_tasks]->auto_condition = need_to_gc;
     - 	tasks[num_tasks]->enabled = 1;
     - 	num_tasks++;
     - 
     -@@ builtin/gc.c: int cmd_maintenance(int argc, const char **argv, const char *prefix)
     - 				   builtin_maintenance_options);
     - 
     - 	opts.quiet = !isatty(2);
     + 		if (tasks[i].fn()) {
     + 			error(_("task '%s' failed"), tasks[i].name);
     + 			result = 1;
     +@@ builtin/gc.c: static void initialize_task_config(void)
     + {
     + 	int i;
     + 	struct strbuf config_name = STRBUF_INIT;
      +	gc_config();
     - 	initialize_tasks();
     ++
     + 	for (i = 0; i < TASK__COUNT; i++) {
     + 		int config_value;
       
     - 	argc = parse_options(argc, argv, prefix,
      
       ## t/t5514-fetch-multiple.sh ##
      @@ t/t5514-fetch-multiple.sh: test_expect_success 'git fetch --multiple (two remotes)' '
 14:  9af2309f08 ! 16:  ef3a854508 maintenance: add auto condition for commit-graph task
     @@ builtin/gc.c: static struct maintenance_opts {
      +
       static int run_write_commit_graph(void)
       {
     - 	int result;
     -@@ builtin/gc.c: static void initialize_tasks(void)
     + 	struct child_process child = CHILD_PROCESS_INIT;
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 	[TASK_COMMIT_GRAPH] = {
     + 		"commit-graph",
     + 		maintenance_task_commit_graph,
     ++		should_write_commit_graph,
     + 	},
     + };
       
     - 	tasks[num_tasks]->name = "commit-graph";
     - 	tasks[num_tasks]->fn = maintenance_task_commit_graph;
     -+	tasks[num_tasks]->auto_condition = should_write_commit_graph;
     - 	num_tasks++;
     - 
     - 	for (i = 0; i < num_tasks; i++) {
      
       ## object.h ##
      @@ object.h: struct object_array {
 15:  42e316ca58 ! 17:  6ac3a58f2f maintenance: create auto condition for loose-objects
     @@ builtin/gc.c: struct write_loose_object_data {
      +					     NULL, NULL, &count);
      +}
      +
     - static int loose_object_exists(const struct object_id *oid,
     - 			       const char *path,
     - 			       void *data)
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 
     - 	tasks[num_tasks]->name = "loose-objects";
     - 	tasks[num_tasks]->fn = maintenance_task_loose_objects;
     -+	tasks[num_tasks]->auto_condition = loose_object_auto_condition;
     - 	num_tasks++;
     - 
     - 	tasks[num_tasks]->name = "incremental-repack";
     + static int bail_on_loose(const struct object_id *oid,
     + 			 const char *path,
     + 			 void *data)
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 	[TASK_LOOSE_OBJECTS] = {
     + 		"loose-objects",
     + 		maintenance_task_loose_objects,
     ++		loose_object_auto_condition,
     + 	},
     + 	[TASK_INCREMENTAL_REPACK] = {
     + 		"incremental-repack",
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'loose-objects task' '
 16:  3d527cb0dd ! 18:  801b262d1c maintenance: add incremental-repack auto condition
     @@ builtin/gc.c: static int maintenance_task_loose_objects(void)
      +
       static int multi_pack_index_write(void)
       {
     - 	int result;
     -@@ builtin/gc.c: static void initialize_tasks(void)
     - 
     - 	tasks[num_tasks]->name = "incremental-repack";
     - 	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
     -+	tasks[num_tasks]->auto_condition = incremental_repack_auto_condition;
     - 	num_tasks++;
     - 
     - 	tasks[num_tasks]->name = "gc";
     + 	struct child_process child = CHILD_PROCESS_INIT;
     +@@ builtin/gc.c: static struct maintenance_task tasks[] = {
     + 	[TASK_INCREMENTAL_REPACK] = {
     + 		"incremental-repack",
     + 		maintenance_task_incremental_repack,
     ++		incremental_repack_auto_condition,
     + 	},
     + 	[TASK_GC] = {
     + 		"gc",
      
       ## t/t7900-maintenance.sh ##
      @@ t/t7900-maintenance.sh: test_expect_success 'incremental-repack task' '
 17:  a0f00f8ab8 = 19:  9b4cef7635 midx: use start_delayed_progress()
 18:  f24db7739f ! 20:  39eb83ad1e maintenance: add trace2 regions for task execution
     @@ Commit message
      
       ## builtin/gc.c ##
      @@ builtin/gc.c: static int maintenance_run(void)
     - 		     !tasks[i]->auto_condition()))
     + 		     !tasks[i].auto_condition()))
       			continue;
       
     -+		trace2_region_enter("maintenance", tasks[i]->name, r);
     - 		result = tasks[i]->fn();
     -+		trace2_region_leave("maintenance", tasks[i]->name, r);
     ++		trace2_region_enter("maintenance", tasks[i].name, r);
     + 		if (tasks[i].fn()) {
     + 			error(_("task '%s' failed"), tasks[i].name);
     + 			result = 1;
     + 		}
     ++		trace2_region_leave("maintenance", tasks[i].name, r);
       	}
       
       	rollback_lock_file(&lk);

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 01/20] maintenance: create basic maintenance runner
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 02/20] maintenance: add --quiet option Derrick Stolee via GitGitGadget
                       ` (21 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'gc' builtin is our current entrypoint for automatically maintaining
a repository. This one tool does many operations, such as repacking the
repository, packing refs, and rewriting the commit-graph file. The name
implies it performs "garbage collection" which means several different
things, and some users may not want to use this operation that rewrites
the entire object database.

Create a new 'maintenance' builtin that will become a more general-
purpose command. To start, it will only support the 'run' subcommand,
but will later expand to add subcommands for scheduling maintenance in
the background.

For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin.
In fact, the only option is the '--auto' toggle, which is handed
directly to the 'gc' builtin. The current change is isolated to this
simple operation to prevent more interesting logic from being lost in
all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the
two builtins. It is possible that we will have 'maintenance' replace the
'gc' builtin entirely at some point, leaving 'git gc' as an alias for
some specific arguments to 'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                        |  1 +
 Documentation/git-maintenance.txt | 57 +++++++++++++++++++++++++++++++
 builtin.h                         |  1 +
 builtin/gc.c                      | 56 ++++++++++++++++++++++++++++++
 git.c                             |  1 +
 t/t7900-maintenance.sh            | 19 +++++++++++
 6 files changed, 135 insertions(+)
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh

diff --git a/.gitignore b/.gitignore
index ee509a2ad2..a5808fa30d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -90,6 +90,7 @@
 /git-ls-tree
 /git-mailinfo
 /git-mailsplit
+/git-maintenance
 /git-merge
 /git-merge-base
 /git-merge-index
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
new file mode 100644
index 0000000000..34cd2b4417
--- /dev/null
+++ b/Documentation/git-maintenance.txt
@@ -0,0 +1,57 @@
+git-maintenance(1)
+==================
+
+NAME
+----
+git-maintenance - Run tasks to optimize Git repository data
+
+
+SYNOPSIS
+--------
+[verse]
+'git maintenance' run [<options>]
+
+
+DESCRIPTION
+-----------
+Run tasks to optimize Git repository data, speeding up other Git commands
+and reducing storage requirements for the repository.
++
+Git commands that add repository data, such as `git add` or `git fetch`,
+are optimized for a responsive user experience. These commands do not take
+time to optimize the Git data, since such optimizations scale with the full
+size of the repository while these user commands each perform a relatively
+small action.
++
+The `git maintenance` command provides flexibility for how to optimize the
+Git repository.
+
+SUBCOMMANDS
+-----------
+
+run::
+	Run one or more maintenance tasks.
+
+TASKS
+-----
+
+gc::
+	Cleanup unnecessary files and optimize the local repository. "GC"
+	stands for "garbage collection," but this task performs many
+	smaller tasks. This task can be rather expensive for large
+	repositories, as it repacks all Git objects into a single pack-file.
+	It can also be disruptive in some situations, as it deletes stale
+	data.
+
+OPTIONS
+-------
+--auto::
+	When combined with the `run` subcommand, run maintenance tasks
+	only if certain thresholds are met. For example, the `gc` task
+	runs when the number of loose objects exceeds the number stored
+	in the `gc.auto` config setting, or when the number of pack-files
+	exceeds the `gc.autoPackLimit` config setting.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/builtin.h b/builtin.h
index a5ae15bfe5..17c1c0ce49 100644
--- a/builtin.h
+++ b/builtin.h
@@ -167,6 +167,7 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix);
 int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 int cmd_mailsplit(int argc, const char **argv, const char *prefix);
+int cmd_maintenance(int argc, const char **argv, const char *prefix);
 int cmd_merge(int argc, const char **argv, const char *prefix);
 int cmd_merge_base(int argc, const char **argv, const char *prefix);
 int cmd_merge_index(int argc, const char **argv, const char *prefix);
diff --git a/builtin/gc.c b/builtin/gc.c
index 10346e0465..be9557452e 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -699,3 +699,59 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	return 0;
 }
+
+static const char * const builtin_maintenance_usage[] = {
+	N_("git maintenance run [<options>]"),
+	NULL
+};
+
+static struct maintenance_opts {
+	int auto_flag;
+} opts;
+
+static int maintenance_task_gc(void)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_push(&child.args, "gc");
+
+	if (opts.auto_flag)
+		strvec_push(&child.args, "--auto");
+
+	close_object_store(the_repository->objects);
+	return run_command(&child);
+}
+
+static int maintenance_run(void)
+{
+	return maintenance_task_gc();
+}
+
+int cmd_maintenance(int argc, const char **argv, const char *prefix)
+{
+	static struct option builtin_maintenance_options[] = {
+		OPT_BOOL(0, "auto", &opts.auto_flag,
+			 N_("run tasks based on the state of the repository")),
+		OPT_END()
+	};
+
+	memset(&opts, 0, sizeof(opts));
+
+	if (argc == 2 && !strcmp(argv[1], "-h"))
+		usage_with_options(builtin_maintenance_usage,
+				   builtin_maintenance_options);
+
+	argc = parse_options(argc, argv, prefix,
+			     builtin_maintenance_options,
+			     builtin_maintenance_usage,
+			     PARSE_OPT_KEEP_UNKNOWN);
+
+	if (argc == 1) {
+		if (!strcmp(argv[0], "run"))
+			return maintenance_run();
+	}
+
+	usage_with_options(builtin_maintenance_usage,
+			   builtin_maintenance_options);
+}
diff --git a/git.c b/git.c
index 832688ca23..5bb9645403 100644
--- a/git.c
+++ b/git.c
@@ -529,6 +529,7 @@ static struct cmd_struct commands[] = {
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "mailsplit", cmd_mailsplit, NO_PARSEOPT },
+	{ "maintenance", cmd_maintenance, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "merge", cmd_merge, RUN_SETUP | NEED_WORK_TREE },
 	{ "merge-base", cmd_merge_base, RUN_SETUP },
 	{ "merge-file", cmd_merge_file, RUN_SETUP_GENTLY },
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
new file mode 100755
index 0000000000..0e864eaaed
--- /dev/null
+++ b/t/t7900-maintenance.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='git maintenance builtin'
+
+. ./test-lib.sh
+
+test_expect_success 'help text' '
+	test_expect_code 129 git maintenance -h 2>err &&
+	test_i18ngrep "usage: git maintenance run" err
+'
+
+test_expect_success 'run [--auto]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	grep ",\"gc\"]" run-no-auto.txt  &&
+	grep ",\"gc\",\"--auto\"]" run-auto.txt
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 02/20] maintenance: add --quiet option
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 01/20] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 03/20] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
                       ` (20 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Maintenance activities are commonly used as steps in larger scripts.
Providing a '--quiet' option allows those scripts to be less noisy when
run on a terminal window. Turn this mode on by default when stderr is
not a terminal.

Pipe the option to the 'git gc' child process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 3 +++
 builtin/gc.c                      | 7 +++++++
 t/t7900-maintenance.sh            | 8 +++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 34cd2b4417..089fa4cedc 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -52,6 +52,9 @@ OPTIONS
 	in the `gc.auto` config setting, or when the number of pack-files
 	exceeds the `gc.autoPackLimit` config setting.
 
+--quiet::
+	Do not report progress or other information over `stderr`.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index be9557452e..3c277f9f9c 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -707,6 +707,7 @@ static const char * const builtin_maintenance_usage[] = {
 
 static struct maintenance_opts {
 	int auto_flag;
+	int quiet;
 } opts;
 
 static int maintenance_task_gc(void)
@@ -718,6 +719,8 @@ static int maintenance_task_gc(void)
 
 	if (opts.auto_flag)
 		strvec_push(&child.args, "--auto");
+	if (opts.quiet)
+		strvec_push(&child.args, "--quiet");
 
 	close_object_store(the_repository->objects);
 	return run_command(&child);
@@ -733,6 +736,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 	static struct option builtin_maintenance_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
+		OPT_BOOL(0, "quiet", &opts.quiet,
+			 N_("do not report progress or other information over stderr")),
 		OPT_END()
 	};
 
@@ -742,6 +747,8 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_maintenance_usage,
 				   builtin_maintenance_options);
 
+	opts.quiet = !isatty(2);
+
 	argc = parse_options(argc, argv, prefix,
 			     builtin_maintenance_options,
 			     builtin_maintenance_usage,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 0e864eaaed..f08eee0977 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -9,11 +9,13 @@ test_expect_success 'help text' '
 	test_i18ngrep "usage: git maintenance run" err
 '
 
-test_expect_success 'run [--auto]' '
-	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
+test_expect_success 'run [--auto|--quiet]' '
+	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
 	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
+	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&
 	grep ",\"gc\"]" run-no-auto.txt  &&
-	grep ",\"gc\",\"--auto\"]" run-auto.txt
+	grep ",\"gc\",\"--auto\"" run-auto.txt &&
+	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 03/20] maintenance: replace run_auto_gc()
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 01/20] maintenance: create basic maintenance runner Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 02/20] maintenance: add --quiet option Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 04/20] maintenance: initialize task array Derrick Stolee via GitGitGadget
                       ` (19 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The run_auto_gc() method is used in several places to trigger a check
for repo maintenance after some Git commands, such as 'git commit' or
'git fetch'.

To allow for extra customization of this maintenance activity, replace
the 'git gc --auto [--quiet]' call with one to 'git maintenance run
--auto [--quiet]'. As we extend the maintenance builtin with other
steps, users will be able to select different maintenance activities.

Rename run_auto_gc() to run_auto_maintenance() to be clearer what is
happening on this call, and to expose all callers in the current diff.
Rewrite the method to use a struct child_process to simplify the calls
slightly.

Since 'git fetch' already allows disabling the 'git gc --auto'
subprocess, add an equivalent option with a different name to be more
descriptive of the new behavior: '--[no-]maintenance'. Update the
documentation to include these options at the same time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/fetch-options.txt |  6 ++++--
 Documentation/git-clone.txt     |  6 +++---
 builtin/am.c                    |  2 +-
 builtin/commit.c                |  2 +-
 builtin/fetch.c                 |  6 ++++--
 builtin/merge.c                 |  2 +-
 builtin/rebase.c                |  4 ++--
 run-command.c                   | 16 +++++++---------
 run-command.h                   |  2 +-
 t/t5510-fetch.sh                |  2 +-
 10 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 6e2a160a47..495bc8ab5a 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -86,9 +86,11 @@ ifndef::git-pull[]
 	Allow several <repository> and <group> arguments to be
 	specified. No <refspec>s may be specified.
 
+--[no-]auto-maintenance::
 --[no-]auto-gc::
-	Run `git gc --auto` at the end to perform garbage collection
-	if needed. This is enabled by default.
+	Run `git maintenance run --auto` at the end to perform automatic
+	repository maintenance if needed. (`--[no-]auto-gc` is a synonym.)
+	This is enabled by default.
 
 --[no-]write-commit-graph::
 	Write a commit-graph after fetching. This overrides the config
diff --git a/Documentation/git-clone.txt b/Documentation/git-clone.txt
index c898310099..097e6a86c5 100644
--- a/Documentation/git-clone.txt
+++ b/Documentation/git-clone.txt
@@ -78,9 +78,9 @@ repository using this option and then delete branches (or use any
 other Git command that makes any existing commit unreferenced) in the
 source repository, some objects may become unreferenced (or dangling).
 These objects may be removed by normal Git operations (such as `git commit`)
-which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
-If these objects are removed and were referenced by the cloned repository,
-then the cloned repository will become corrupt.
+which automatically call `git maintenance run --auto`. (See
+linkgit:git-maintenance[1].) If these objects are removed and were referenced
+by the cloned repository, then the cloned repository will become corrupt.
 +
 Note that running `git repack` without the `--local` option in a repository
 cloned with `--shared` will copy objects from the source repository into a pack
diff --git a/builtin/am.c b/builtin/am.c
index 3f2adb3822..2ca363f72e 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -1795,7 +1795,7 @@ static void am_run(struct am_state *state, int resume)
 	if (!state->rebasing) {
 		am_destroy(state);
 		close_object_store(the_repository->objects);
-		run_auto_gc(state->quiet);
+		run_auto_maintenance(state->quiet);
 	}
 }
 
diff --git a/builtin/commit.c b/builtin/commit.c
index 01105ce8b0..9705bfb0cf 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1702,7 +1702,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
 	git_test_write_commit_graph_or_die();
 
 	repo_rerere(the_repository, 0);
-	run_auto_gc(quiet);
+	run_auto_maintenance(quiet);
 	run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
 	if (amend && !no_post_rewrite) {
 		commit_post_rewrite(the_repository, current_head, &oid);
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 7953a1a25b..c7c8ac0861 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
 	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
 			N_("report that we have only objects reachable from this object")),
 	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
+	OPT_BOOL(0, "auto-maintenance", &enable_auto_gc,
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
-		 N_("run 'gc --auto' after fetching")),
+		 N_("run 'maintenance --auto' after fetching")),
 	OPT_BOOL(0, "show-forced-updates", &fetch_show_forced_updates,
 		 N_("check for forced-updates on all updated branches")),
 	OPT_BOOL(0, "write-commit-graph", &fetch_write_commit_graph,
@@ -1882,7 +1884,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	close_object_store(the_repository->objects);
 
 	if (enable_auto_gc)
-		run_auto_gc(verbosity < 0);
+		run_auto_maintenance(verbosity < 0);
 
 	return result;
 }
diff --git a/builtin/merge.c b/builtin/merge.c
index 7da707bf55..c068e73037 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -457,7 +457,7 @@ static void finish(struct commit *head_commit,
 			 * user should see them.
 			 */
 			close_object_store(the_repository->objects);
-			run_auto_gc(verbosity < 0);
+			run_auto_maintenance(verbosity < 0);
 		}
 	}
 	if (new_head && show_diffstat) {
diff --git a/builtin/rebase.c b/builtin/rebase.c
index 494107a648..d14d18191b 100644
--- a/builtin/rebase.c
+++ b/builtin/rebase.c
@@ -728,10 +728,10 @@ static int finish_rebase(struct rebase_options *opts)
 	apply_autostash(state_dir_path("autostash", opts));
 	close_object_store(the_repository->objects);
 	/*
-	 * We ignore errors in 'gc --auto', since the
+	 * We ignore errors in 'git maintenance run --auto', since the
 	 * user should see them.
 	 */
-	run_auto_gc(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
+	run_auto_maintenance(!(opts->flags & (REBASE_NO_QUIET|REBASE_VERBOSE)));
 	if (opts->type == REBASE_MERGE) {
 		struct replay_opts replay = REPLAY_OPTS_INIT;
 
diff --git a/run-command.c b/run-command.c
index 30104a4ee1..b7e1f1dd5a 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1866,15 +1866,13 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
 	return result;
 }
 
-int run_auto_gc(int quiet)
+int run_auto_maintenance(int quiet)
 {
-	struct strvec argv_gc_auto = STRVEC_INIT;
-	int status;
+	struct child_process maint = CHILD_PROCESS_INIT;
 
-	strvec_pushl(&argv_gc_auto, "gc", "--auto", NULL);
-	if (quiet)
-		strvec_push(&argv_gc_auto, "--quiet");
-	status = run_command_v_opt(argv_gc_auto.items, RUN_GIT_CMD);
-	strvec_clear(&argv_gc_auto);
-	return status;
+	maint.git_cmd = 1;
+	strvec_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
+	strvec_push(&maint.args, quiet ? "--quiet" : "--no-quiet");
+
+	return run_command(&maint);
 }
diff --git a/run-command.h b/run-command.h
index 8b9bfaef16..6472b38bde 100644
--- a/run-command.h
+++ b/run-command.h
@@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
 /*
  * Trigger an auto-gc
  */
-int run_auto_gc(int quiet);
+int run_auto_maintenance(int quiet);
 
 #define RUN_COMMAND_NO_STDIN 1
 #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index a66dbe0bde..9850ecde5d 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
 		git config fetch.unpackLimit 1 &&
 		git config gc.autoPackLimit 1 &&
 		git config gc.autoDetach false &&
-		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
+		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
 		test_i18ngrep "Auto packing the repository" fetch.out &&
 		! grep "Should I try again" fetch.out
 	)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 04/20] maintenance: initialize task array
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 03/20] maintenance: replace run_auto_gc() Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 05/20] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
                       ` (18 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of implementing multiple maintenance tasks inside the
'maintenance' builtin, use a list of structs to describe the work to be
done.

The struct maintenance_task stores the name of the task (as given by a
future command-line argument) along with a function pointer to its
implementation and a boolean for whether the step is enabled.

A list these structs are initialized with the full list of implemented
tasks along with a default order. For now, this list only contains the
"gc" task. This task is also the only task enabled by default.

The run subcommand will return a nonzero exit code if any task fails.
However, it will attempt all tasks in its loop before returning with the
failure. Also each failed task will send an error message.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 3c277f9f9c..0f15162825 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -726,9 +726,45 @@ static int maintenance_task_gc(void)
 	return run_command(&child);
 }
 
+typedef int maintenance_task_fn(void);
+
+struct maintenance_task {
+	const char *name;
+	maintenance_task_fn *fn;
+	unsigned enabled:1;
+};
+
+enum maintenance_task_label {
+	TASK_GC,
+
+	/* Leave as final value */
+	TASK__COUNT
+};
+
+static struct maintenance_task tasks[] = {
+	[TASK_GC] = {
+		"gc",
+		maintenance_task_gc,
+		1,
+	},
+};
+
 static int maintenance_run(void)
 {
-	return maintenance_task_gc();
+	int i;
+	int result = 0;
+
+	for (i = 0; i < TASK__COUNT; i++) {
+		if (!tasks[i].enabled)
+			continue;
+
+		if (tasks[i].fn()) {
+			error(_("task '%s' failed"), tasks[i].name);
+			result = 1;
+		}
+	}
+
+	return result;
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 05/20] maintenance: add commit-graph task
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 04/20] maintenance: initialize task array Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 06/20] maintenance: add --task option Derrick Stolee via GitGitGadget
                       ` (17 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The first new task in the 'git maintenance' builtin is the
'commit-graph' job. It is based on the sequence of events in the
'commit-graph' job in Scalar [1]. This sequence is as follows:

1. git commit-graph write --reachable --split
2. git commit-graph verify --shallow
3. If the verify succeeds, stop.
4. Delete the commit-graph-chain file.
5. git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

By using 'git commit-graph verify --shallow' we can ensure that
the file we just wrote is valid. This is an extra safety precaution
that is faster than our 'write' subcommand. In the rare situation
that the newest layer of the commit-graph is corrupt, we can "fix"
the corruption by deleting the commit-graph-chain file and rewrite
the full commit-graph as a new one-layer commit graph. This does
not completely prevent _that_ file from being corrupt, but it does
recompute the commit-graph by parsing commits from the object
database. In our use of this step in Scalar and VFS for Git, we
have only seen this issue arise because our microsoft/git fork
reverted 43d3561 ("commit-graph write: don't die if the existing
graph is corrupt" 2019-03-25) for a while to keep commit-graph
writes very fast. We dropped the revert when updating to v2.23.0.
The verify still has potential for catching corrupt data across
the layer boundary: if the new file has commit X with parent Y
in an old file but the commit ID for Y in the old file had a
bitswap, then we will notice that in the 'verify' command.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 18 +++++++++
 builtin/gc.c                      | 63 +++++++++++++++++++++++++++++++
 commit-graph.c                    |  8 ++--
 commit-graph.h                    |  1 +
 t/t7900-maintenance.sh            |  2 +
 5 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 089fa4cedc..35b0be7d40 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -35,6 +35,24 @@ run::
 TASKS
 -----
 
+commit-graph::
+	The `commit-graph` job updates the `commit-graph` files incrementally,
+	then verifies that the written data is correct. If the new layer has an
+	issue, then the chain file is removed and the `commit-graph` is
+	rewritten from scratch.
++
+The verification only checks the top layer of the `commit-graph` chain.
+If the incremental write merged the new commits with at least one
+existing layer, then there is potential for on-disk corruption being
+carried forward into the new file. This will be noticed and the new
+commit-graph file will be clean as Git reparses the commit data from
+the object database.
++
+The incremental write is safe to run alongside concurrent Git processes
+since it will not expire `.graph` files that were in the previous
+`commit-graph-chain` file. They will be deleted by a later run based on
+the expiration delay.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index 0f15162825..ec1bbc3f9e 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -710,6 +710,64 @@ static struct maintenance_opts {
 	int quiet;
 } opts;
 
+static int run_write_commit_graph(void)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "commit-graph", "write",
+		     "--split", "--reachable", NULL);
+
+	if (opts.quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	return !!run_command(&child);
+}
+
+static int run_verify_commit_graph(void)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "commit-graph", "verify",
+		     "--shallow", NULL);
+
+	if (opts.quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	return !!run_command(&child);
+}
+
+static int maintenance_task_commit_graph(void)
+{
+	struct repository *r = the_repository;
+	char *chain_path;
+
+	close_object_store(r->objects);
+	if (run_write_commit_graph()) {
+		error(_("failed to write commit-graph"));
+		return 1;
+	}
+
+	if (!run_verify_commit_graph())
+		return 0;
+
+	warning(_("commit-graph verify caught error, rewriting"));
+
+	chain_path = get_commit_graph_chain_filename(r->objects->odb);
+	if (unlink(chain_path)) {
+		UNLEAK(chain_path);
+		die(_("failed to remove commit-graph at %s"), chain_path);
+	}
+	free(chain_path);
+
+	if (!run_write_commit_graph())
+		return 0;
+
+	error(_("failed to rewrite commit-graph"));
+	return 1;
+}
+
 static int maintenance_task_gc(void)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -736,6 +794,7 @@ struct maintenance_task {
 
 enum maintenance_task_label {
 	TASK_GC,
+	TASK_COMMIT_GRAPH,
 
 	/* Leave as final value */
 	TASK__COUNT
@@ -747,6 +806,10 @@ static struct maintenance_task tasks[] = {
 		maintenance_task_gc,
 		1,
 	},
+	[TASK_COMMIT_GRAPH] = {
+		"commit-graph",
+		maintenance_task_commit_graph,
+	},
 };
 
 static int maintenance_run(void)
diff --git a/commit-graph.c b/commit-graph.c
index 1af68c297d..9705d237e4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -172,7 +172,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
 		       oid_hex);
 }
 
-static char *get_chain_filename(struct object_directory *odb)
+char *get_commit_graph_chain_filename(struct object_directory *odb)
 {
 	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
 }
@@ -521,7 +521,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	struct stat st;
 	struct object_id *oids;
 	int i = 0, valid = 1, count;
-	char *chain_name = get_chain_filename(odb);
+	char *chain_name = get_commit_graph_chain_filename(odb);
 	FILE *fp;
 	int stat_res;
 
@@ -1619,7 +1619,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	}
 
 	if (ctx->split) {
-		char *lock_name = get_chain_filename(ctx->odb);
+		char *lock_name = get_commit_graph_chain_filename(ctx->odb);
 
 		hold_lock_file_for_update_mode(&lk, lock_name,
 					       LOCK_DIE_ON_ERROR, 0444);
@@ -1996,7 +1996,7 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	if (ctx->split_opts && ctx->split_opts->expire_time)
 		expire_time = ctx->split_opts->expire_time;
 	if (!ctx->split) {
-		char *chain_file_name = get_chain_filename(ctx->odb);
+		char *chain_file_name = get_commit_graph_chain_filename(ctx->odb);
 		unlink(chain_file_name);
 		free(chain_file_name);
 		ctx->num_commit_graphs_after = 0;
diff --git a/commit-graph.h b/commit-graph.h
index 28f89cdf3e..3c202748c3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -25,6 +25,7 @@ struct commit;
 struct bloom_filter_settings;
 
 char *get_commit_graph_filename(struct object_directory *odb);
+char *get_commit_graph_chain_filename(struct object_directory *odb);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
 
 /*
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index f08eee0977..ff646abf7c 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -4,6 +4,8 @@ test_description='git maintenance builtin'
 
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH=0
+
 test_expect_success 'help text' '
 	test_expect_code 129 git maintenance -h 2>err &&
 	test_i18ngrep "usage: git maintenance run" err
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 06/20] maintenance: add --task option
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 05/20] maintenance: add commit-graph task Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 07/20] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
                       ` (16 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A user may want to only run certain maintenance tasks in a certain
order. Add the --task=<task> option, which allows a user to specify an
ordered list of tasks to run. These cannot be run multiple times,
however.

Here is where our array of maintenance_task pointers becomes critical.
We can sort the array of pointers based on the task order, but we do not
want to move the struct data itself in order to preserve the hashmap
references. We use the hashmap to match the --task=<task> arguments into
the task struct data.

Keep in mind that the 'enabled' member of the maintenance_task struct is
a placeholder for a future 'maintenance.<task>.enabled' config option.
Thus, we use the 'enabled' member to specify which tasks are run when
the user does not specify any --task=<task> arguments. The 'enabled'
member should be ignored if --task=<task> appears.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt |  4 +++
 builtin/gc.c                      | 59 +++++++++++++++++++++++++++++--
 t/t7900-maintenance.sh            | 23 ++++++++++++
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 35b0be7d40..9204762e21 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -73,6 +73,10 @@ OPTIONS
 --quiet::
 	Do not report progress or other information over `stderr`.
 
+--task=<task>::
+	If this option is specified one or more times, then only run the
+	specified tasks in the specified order.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/gc.c b/builtin/gc.c
index ec1bbc3f9e..b7f64891cd 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -708,6 +708,7 @@ static const char * const builtin_maintenance_usage[] = {
 static struct maintenance_opts {
 	int auto_flag;
 	int quiet;
+	int tasks_selected;
 } opts;
 
 static int run_write_commit_graph(void)
@@ -789,7 +790,9 @@ typedef int maintenance_task_fn(void);
 struct maintenance_task {
 	const char *name;
 	maintenance_task_fn *fn;
-	unsigned enabled:1;
+	unsigned enabled:1,
+		 selected:1;
+	int selected_order;
 };
 
 enum maintenance_task_label {
@@ -812,13 +815,29 @@ static struct maintenance_task tasks[] = {
 	},
 };
 
+static int compare_tasks_by_selection(const void *a_, const void *b_)
+{
+	const struct maintenance_task *a, *b;
+
+	a = (const struct maintenance_task *)&a_;
+	b = (const struct maintenance_task *)&b_;
+
+	return b->selected_order - a->selected_order;
+}
+
 static int maintenance_run(void)
 {
 	int i;
 	int result = 0;
 
+	if (opts.tasks_selected)
+		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
+
 	for (i = 0; i < TASK__COUNT; i++) {
-		if (!tasks[i].enabled)
+		if (opts.tasks_selected && !tasks[i].selected)
+			continue;
+
+		if (!opts.tasks_selected && !tasks[i].enabled)
 			continue;
 
 		if (tasks[i].fn()) {
@@ -830,6 +849,39 @@ static int maintenance_run(void)
 	return result;
 }
 
+static int task_option_parse(const struct option *opt,
+			     const char *arg, int unset)
+{
+	int i;
+	struct maintenance_task *task = NULL;
+
+	BUG_ON_OPT_NEG(unset);
+
+	opts.tasks_selected++;
+
+	for (i = 0; i < TASK__COUNT; i++) {
+		if (!strcasecmp(tasks[i].name, arg)) {
+			task = &tasks[i];
+			break;
+		}
+	}
+
+	if (!task) {
+		error(_("'%s' is not a valid task"), arg);
+		return 1;
+	}
+
+	if (task->selected) {
+		error(_("task '%s' cannot be selected multiple times"), arg);
+		return 1;
+	}
+
+	task->selected = 1;
+	task->selected_order = opts.tasks_selected;
+
+	return 0;
+}
+
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_maintenance_options[] = {
@@ -837,6 +889,9 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 			 N_("run tasks based on the state of the repository")),
 		OPT_BOOL(0, "quiet", &opts.quiet,
 			 N_("do not report progress or other information over stderr")),
+		OPT_CALLBACK_F(0, "task", NULL, N_("task"),
+			N_("run a specific task"),
+			PARSE_OPT_NONEG, task_option_parse),
 		OPT_END()
 	};
 
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index ff646abf7c..3cdccb24df 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -20,4 +20,27 @@ test_expect_success 'run [--auto|--quiet]' '
 	grep ",\"gc\",\"--quiet\"" run-quiet.txt
 '
 
+test_expect_success 'run --task=<task>' '
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-gc.txt" git maintenance run --task=gc &&
+	GIT_TRACE2_EVENT="$(pwd)/run-commit-graph.txt" git maintenance run --task=commit-graph &&
+	GIT_TRACE2_EVENT="$(pwd)/run-both.txt" git maintenance run --task=commit-graph --task=gc &&
+	! grep ",\"gc\"" run-commit-graph.txt  &&
+	grep ",\"gc\"" run-gc.txt  &&
+	grep ",\"gc\"" run-both.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-commit-graph.txt  &&
+	! grep ",\"commit-graph\",\"write\"" run-gc.txt  &&
+	grep ",\"commit-graph\",\"write\"" run-both.txt
+'
+
+test_expect_success 'run --task=bogus' '
+	test_must_fail git maintenance run --task=bogus 2>err &&
+	test_i18ngrep "is not a valid task" err
+'
+
+test_expect_success 'run --task duplicate' '
+	test_must_fail git maintenance run --task=gc --task=gc 2>err &&
+	test_i18ngrep "cannot be selected multiple times" err
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 07/20] maintenance: take a lock on the objects directory
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 06/20] maintenance: add --task option Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 08/20] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
                       ` (15 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Performing maintenance on a Git repository involves writing data to the
.git directory, which is not safe to do with multiple writers attempting
the same operation. Ensure that only one 'git maintenance' process is
running at a time by holding a file-based lock. Simply the presence of
the .git/maintenance.lock file will prevent future maintenance. This
lock is never committed, since it does not represent meaningful data.
Instead, it is only a placeholder.

If the lock file already exists, then fail silently. This will become
very important later when we implement the 'fetch' task, as this is our
stop-gap from creating a recursive process loop between 'git fetch' and
'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/gc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index b7f64891cd..b57bc7b0ff 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -829,6 +829,25 @@ static int maintenance_run(void)
 {
 	int i;
 	int result = 0;
+	struct lock_file lk;
+	struct repository *r = the_repository;
+	char *lock_path = xstrfmt("%s/maintenance", r->objects->odb->path);
+
+	if (hold_lock_file_for_update(&lk, lock_path, LOCK_NO_DEREF) < 0) {
+		/*
+		 * Another maintenance command is running.
+		 *
+		 * If --auto was provided, then it is likely due to a
+		 * recursive process stack. Do not report an error in
+		 * that case.
+		 */
+		if (!opts.auto_flag && !opts.quiet)
+			error(_("lock file '%s' exists, skipping maintenance"),
+			      lock_path);
+		free(lock_path);
+		return 0;
+	}
+	free(lock_path);
 
 	if (opts.tasks_selected)
 		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
@@ -846,6 +865,7 @@ static int maintenance_run(void)
 		}
 	}
 
+	rollback_lock_file(&lk);
 	return result;
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 08/20] fetch: optionally allow disabling FETCH_HEAD update
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 07/20] maintenance: take a lock on the objects directory Derrick Stolee via GitGitGadget
@ 2020-07-30 22:24     ` Junio C Hamano via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 09/20] maintenance: add prefetch task Derrick Stolee via GitGitGadget
                       ` (14 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Junio C Hamano via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Junio C Hamano

From: Junio C Hamano <gitster@pobox.com>

If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head"
and "fetch.writeFetchHEAD" configuration variable.  Without either,
the default is to write FETCH_HEAD, and the usual rule that the
command line option defeats configured default applies.

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--write-fetch-head` does not force `git
fetch` to write the file.

Also note that this option is explicitly passed when "git pull"
internally invokes "git fetch", so that those who configured their
"git fetch" not to write FETCH_HEAD would not be able to break the
cooperation between these two commands.  "git pull" must see what
"git fetch" got recorded in FETCH_HEAD to work correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/fetch.txt  |  7 ++++++
 Documentation/fetch-options.txt | 10 +++++++++
 builtin/fetch.c                 | 19 +++++++++++++---
 builtin/pull.c                  |  3 ++-
 t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
 t/t5521-pull-options.sh         | 16 ++++++++++++++
 6 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
index b20394038d..0aaa05e8c0 100644
--- a/Documentation/config/fetch.txt
+++ b/Documentation/config/fetch.txt
@@ -91,3 +91,10 @@ fetch.writeCommitGraph::
 	merge and the write may take longer. Having an updated commit-graph
 	file helps performance of many Git commands, including `git merge-base`,
 	`git push -f`, and `git log --graph`. Defaults to false.
+
+fetch.writeFetchHEAD::
+	Setting it to false tells `git fetch` not to write the list
+	of remote refs fetched in the `FETCH_HEAD` file directly
+	under `$GIT_DIR`.  Can be countermanded from the command
+	line with the `--[no-]write-fetch-head` option.  Defaults to
+	true.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 495bc8ab5a..6972ad2522 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
 --dry-run::
 	Show what would be done, without making any changes.
 
+ifndef::git-pull[]
+--[no-]write-fetch-head::
+	Write the list of remote refs fetched in the `FETCH_HEAD`
+	file directly under `$GIT_DIR`.  This is the default unless
+	the configuration variable `fetch.writeFetchHEAD` is set to
+	false.  Passing `--no-write-fetch-head` from the command
+	line tells Git not to write the file.  Under `--dry-run`
+	option, the file is never written.
+endif::git-pull[]
+
 -f::
 --force::
 	When 'git fetch' is used with `<src>:<dst>` refspec it may
diff --git a/builtin/fetch.c b/builtin/fetch.c
index c7c8ac0861..30ac57dcf6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok;
+static int write_fetch_head = 1;
 static int verbosity, deepen_relative, set_upstream;
 static int progress = -1;
 static int enable_auto_gc = 1;
@@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writefetchhead")) {
+		write_fetch_head = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
+	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
+		 N_("write fetched references to the FETCH_HEAD file")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
 	OPT_BOOL('u', "update-head-ok", &update_head_ok,
 		    N_("allow updating of HEAD ref")),
@@ -895,7 +902,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *what, *kind;
 	struct ref *rm;
 	char *url;
-	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
+	const char *filename = (!write_fetch_head
+				? "/dev/null"
+				: git_path_fetch_head(the_repository));
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
 
@@ -1329,7 +1338,7 @@ static int do_fetch(struct transport *transport,
 	}
 
 	/* if not appending, truncate FETCH_HEAD */
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		retcode = truncate_fetch_head();
 		if (retcode)
 			goto cleanup;
@@ -1596,7 +1605,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
 	int i, result = 0;
 	struct strvec argv = STRVEC_INIT;
 
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1797,6 +1806,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	/* FETCH_HEAD never gets updated in --dry-run mode */
+	if (dry_run)
+		write_fetch_head = 0;
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/builtin/pull.c b/builtin/pull.c
index 858b492af3..4c66db1468 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
 	struct strvec args = STRVEC_INIT;
 	int ret;
 
-	strvec_pushl(&args, "fetch", "--update-head-ok", NULL);
+	strvec_pushl(&args, "fetch", "--update-head-ok",
+		     "--write-fetch-head", NULL);
 
 	/* Shared options */
 	argv_push_verbosity(&args);
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index 9850ecde5d..31c91d0ed2 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
 
 '
 
-test_expect_success 'fetch --dry-run' '
-
+test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
 	rm -f .git/FETCH_HEAD &&
 	git fetch --dry-run . &&
 	! test -f .git/FETCH_HEAD
 '
 
+test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success '--write-fetch-head gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --dry-run --write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
+	test -f .git/FETCH_HEAD
+'
+
 test_expect_success "should be able to fetch with duplicate refspecs" '
 	mkdir dups &&
 	(
diff --git a/t/t5521-pull-options.sh b/t/t5521-pull-options.sh
index 159afa7ac8..1acae3b9a4 100755
--- a/t/t5521-pull-options.sh
+++ b/t/t5521-pull-options.sh
@@ -77,6 +77,7 @@ test_expect_success 'git pull -q -v --no-rebase' '
 	test_must_be_empty out &&
 	test -s err)
 '
+
 test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	mkdir clonedcleanup &&
 	(cd clonedcleanup && git init &&
@@ -85,6 +86,21 @@ test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	test -s err)
 '
 
+test_expect_success 'git pull --no-write-fetch-head fails' '
+	mkdir clonedwfh &&
+	(cd clonedwfh && git init &&
+	test_must_fail git pull --no-write-fetch-head "../parent" >out 2>err &&
+	test_must_be_empty out &&
+	test_i18ngrep "no-write-fetch-head" err)
+'
+
+test_expect_success 'git pull succeeds with fetch.writeFetchHEAD=false' '
+	mkdir clonedwfhconfig &&
+	(cd clonedwfhconfig && git init &&
+	git config fetch.writeFetchHEAD false &&
+	git pull "../parent" >out 2>err &&
+	grep FETCH_HEAD err)
+'
 
 test_expect_success 'git pull --force' '
 	mkdir clonedoldstyle &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v3 09/20] maintenance: add prefetch task
  2020-07-30 22:24   ` [PATCH v3 00/20] " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2020-07-30 22:24     ` [PATCH v3 08/20] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
@ 2020-07-30 22:24     ` Derrick Stolee via GitGitGadget
  2020-07-30 22:24     ` [PATCH v3 10/20] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
                       ` (13 subsequent siblings)
  22 siblings, 0 replies; 164+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-30 22:24 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, emilyshaffer, sluongng,
	jonathantanmy, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.
   While this looks confusing, this was documented and tested by
   b40a50264ac (fetch: document and test --refmap="", 2020-01-21),
   including this sentence in the documentation:

	Providing an empty `<refspec>` to the `--refmap` option
	causes Git to ignore the configured refspecs and rely
	entirely on the refspecs supplied as command-line arguments.

3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/prefetch/<remote> refs that no
   longer appear on the remote.

5. --no-write-fetch-head prevents updating FETCH_HEAD.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 15 +++++++++
 builtin/gc.c                      | 52 +++++++++++++++++++++++++++++++
 t/t7900-maintenance.sh            | 24 ++++++++++++++
 3 files changed, 91 insertions(+)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 9204762e21..d134192fa8 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -53,6 +53,21 @@ since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+prefetch::
+	The `prefetch` task updates the object directory with the latest
+	objects from all registered remotes. For each remote, a `git fetch`
+	command is run. The refmap is custom to avoid updating local or remote
+	branches (those in `refs/heads` or `refs/remotes`). Instead, the
+	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
+	not updated.
++
+This is done to avoid disrupting the remote-tracking branches. The end users
+expect these refs to stay unmoved unless they initiate a fetch.  With prefetch
+task, however, the objects necessary to complete a later real fetch would
+already be obtained, so the real fetch would go faster.  In the ideal case,
+it will just become an update to bunch of remote-tracking branches without
+any object transfer.
+
 gc::
 	Cleanup unnecessary files and optimize the local repository. "GC"
 	stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index b57bc7b0ff..1f20428286 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -28,6 +28,7 @@
 #include "blob.h"
 #include "tree.h"
 #include "promisor-remote.h"
+#include "remote.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -769,6 +770,52 @@ static int maintenance_task_commit_graph(void)
 	return 1;
 }
 
+static int fetch_remote(const char *remote)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
+		     "--no-write-fetch-head", "--refmap=", NULL);
+
+	st