git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
@ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw)
  To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.

3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/hidden/<remote> refs that no
   longer appear on the remote.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

RFC QUESTIONS:

1. One downside of the refs/hidden pattern is that 'git log' will
   decorate commits with twice as many refs if they appear at a
   remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
   there an easy way to exclude a refspace from decorations? Should
   we make refs/hidden/* a "special" refspace that is excluded from
   decorations?

2. This feature is designed for a desktop machine or equivalent
   that has a permanent wired network connection, and the machine
   stays on while the user is not present. For things like laptops
   with inconsistent WiFi connections (that may be metered) the
   feature can use the less stable connection more than the user
   wants. Of course, this feature is opt-in for Git, but in Scalar
   we have a "scalar pause" command [2] that pauses all maintenance
   for some amount of time. We should consider a similar mechanism
   for Git, but for the point of this series the user needs to set
   up the "background" part of these jobs manually.

[2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
[3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 13 ++++-
 builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
 t/t7900-run-job.sh            | 22 +++++++++
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 8bf2762d650..eb92e891915 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job commit-graph'
+'git run-job (commit-graph|fetch)'
 
 
 DESCRIPTION
@@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
 `commit-graph-chain` file. They will be deleted by a later run based on
 the expiration delay.
 
+'fetch'::
+
+The `fetch` job updates the object directory with the latest objects
+from all registered remotes. For each remote, a `git fetch` command is
+run. The refmap is custom to avoid updating local or remote branches
+(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
+stored in `refs/hidden/<remote>/`. Also, no tags are updated.
++
+This means that foreground fetches are still required to update the
+remote refs, but the users is notified when the branches and tags are
+updated on the remote.
 
 GIT
 ---
diff --git a/builtin/run-job.c b/builtin/run-job.c
index dd7709952d3..e59056b2918 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -7,7 +7,7 @@
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job commit-graph"),
+	N_("git run-job (commit-graph|fetch)"),
 	NULL
 };
 
@@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
 	return 1;
 }
 
+static int fetch_remote(const char *remote)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf refmap = STRBUF_INIT;
+
+	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
+			 "--no-tags", "--refmap=", NULL);
+
+	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
+	argv_array_push(&cmd, refmap.buf);
+
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	strbuf_release(&refmap);
+	return result;
+}
+
+static int fill_remotes(struct string_list *remotes)
+{
+	int result = 0;
+	FILE *proc_out;
+	struct strbuf line = STRBUF_INIT;
+	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
+
+	child_process_init(remote_proc);
+
+	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
+
+	remote_proc->out = -1;
+
+	if (start_command(remote_proc)) {
+		error(_("failed to start 'git remote' process"));
+		result = 1;
+		goto cleanup;
+	}
+
+	proc_out = xfdopen(remote_proc->out, "r");
+
+	/* if there is no line, leave the value as given */
+	while (!strbuf_getline(&line, proc_out))
+		string_list_append(remotes, line.buf);
+
+	strbuf_release(&line);
+
+	fclose(proc_out);
+
+	if (finish_command(remote_proc)) {
+		error(_("failed to finish 'git remote' process"));
+		result = 1;
+	}
+
+cleanup:
+	free(remote_proc);
+	return result;
+}
+
+static int run_fetch_job(void)
+{
+	int result = 0;
+	struct string_list_item *item;
+	struct string_list remotes = STRING_LIST_INIT_DUP;
+
+	if (fill_remotes(&remotes)) {
+		error(_("failed to fill remotes"));
+		result = 1;
+		goto cleanup;
+	}
+
+	/*
+	 * Do not modify the result based on the success of the 'fetch'
+	 * operation, as a loss of network could cause 'fetch' to fail
+	 * quickly. We do not want that to stop the rest of our
+	 * background operations.
+	 */
+	for (item = remotes.items;
+	     item && item < remotes.items + remotes.nr;
+	     item++)
+		fetch_remote(item->string);
+
+cleanup:
+	string_list_clear(&remotes, 0);
+	return result;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 	if (argc > 0) {
 		if (!strcmp(argv[0], "commit-graph"))
 			return run_commit_graph_job();
+		if (!strcmp(argv[0], "fetch"))
+			return run_fetch_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 18b9bd26b3a..d3faeba135b 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
 	)
 '
 
+test_expect_success 'fetch job' '
+	git clone "file://$(pwd)/server" client &&
+
+	# Before fetching, build a client commit-graph
+	git -C client run-job commit-graph &&
+	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
+	test_line_count = 1 $chain &&
+
+	git -C client branch -v --remotes >before-refs &&
+	test_commit -C server 24 &&
+
+	git -C client run-job fetch &&
+	git -C client branch -v --remotes >after-refs &&
+	test_cmp before-refs after-refs &&
+	test_cmp server/.git/refs/heads/master \
+		 client/.git/refs/hidden/origin/master &&
+
+	# the hidden ref should trigger a new layer in the commit-graph
+	git -C client run-job commit-graph &&
+	test_line_count = 2 $chain
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
@ 2020-04-05 15:14   ` Phillip Wood
  2020-04-06 12:48     ` Derrick Stolee
  2020-04-05 20:28   ` Junio C Hamano
  2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 1 reply; 8+ messages in thread
From: Phillip Wood @ 2020-04-05 15:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, stolee, Derrick Stolee

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>     the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>     refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>     we can ensure that we actually load the new values somewhere in
>     our refspace while not updating refs/heads or refs/remotes. By
>     storing these refs here, the commit-graph job will update the
>     commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>     longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>     decorate commits with twice as many refs if they appear at a
>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>     there an easy way to exclude a refspace from decorations? Should
>     we make refs/hidden/* a "special" refspace that is excluded from
>     decorations?

Having some way to specify which refs outside of 
refs/{heads,remote,tags}/ to show or exclude from decorations would be 
useful I think. Fetching to a hidden ref is a good idea (as are the 
other steps you outline above) but as you say we don't want it to show 
up in the output of 'git log' etc.

> 2. This feature is designed for a desktop machine or equivalent
>     that has a permanent wired network connection, and the machine
>     stays on while the user is not present. For things like laptops
>     with inconsistent WiFi connections (that may be metered) the
>     feature can use the less stable connection more than the user
>     wants. Of course, this feature is opt-in for Git, but in Scalar
>     we have a "scalar pause" command [2] that pauses all maintenance
>     for some amount of time. We should consider a similar mechanism
>     for Git, but for the point of this series the user needs to set
>     up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/git-run-job.txt | 13 ++++-
>   builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>   t/t7900-run-job.sh            | 22 +++++++++
>   3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>   SYNOPSIS
>   --------
>   [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>   
>   
>   DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>   `commit-graph-chain` file. They will be deleted by a later run based on
>   the expiration delay.
>   
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>   
>   GIT
>   ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>   #include "run-command.h"
>   
>   static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>   	NULL
>   };
>   
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>   	return 1;
>   }
>   
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)

Isn't there a easy way to get this using the config api rather than 
forking 'git remote'?

Best Wishes

Phillip

> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>   int cmd_run_job(int argc, const char **argv, const char *prefix)
>   {
>   	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>   	if (argc > 0) {
>   		if (!strcmp(argv[0], "commit-graph"))
>   			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>   	}
>   
>   	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>   	)
>   '
>   
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>   test_done
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
@ 2020-04-05 20:28   ` Junio C Hamano
  2020-04-06 12:46     ` Derrick Stolee
  2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2020-04-05 20:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations?

I do not think there is, but it makes sense to teach the decoration
machinery to either use only certain refs hierarchies or use all
hierarchies except for certain ones; if we want to make sure we
won't break existing workflows, we should by defautlt use all the
refs we currently use and nothing else, but over time we probably
would want to migrate the default to cover only the local and
remote-tracking branches and tags (and at that point, refs/hidden
would be outside the decoration source).

By the way, I have a moderately strong opinion against the use of
"refs/hidden" for the purpose of "prefetch in advance, without
disrupting refs/remotes".  There may be more than one reason why
some refs want to be "hidden", and depending on the purose, the
exact refs that one workflow (e.g. "decorate") wants to hide may be
the ones that want to be exposed.

If we rename it to "refs/prefetch/", would it make the purpose of
the hierarchy clearer without squatting on a vague (because it does
not tell why it is hidden) name "hidden" that other people might
want to use to hide their stuff for different reasons?

> Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?

See above.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-05 20:28   ` Junio C Hamano
@ 2020-04-06 12:46     ` Derrick Stolee
  0 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:46 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, Derrick Stolee

On 4/5/2020 4:28 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>    decorate commits with twice as many refs if they appear at a
>>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>    there an easy way to exclude a refspace from decorations?
> 
> I do not think there is, but it makes sense to teach the decoration
> machinery to either use only certain refs hierarchies or use all
> hierarchies except for certain ones; if we want to make sure we
> won't break existing workflows, we should by defautlt use all the
> refs we currently use and nothing else, but over time we probably
> would want to migrate the default to cover only the local and
> remote-tracking branches and tags (and at that point, refs/hidden
> would be outside the decoration source).

I'll see what I can do about adding config to remove some refs from
decorations. That is immediately useful for Scalar users, too.

> By the way, I have a moderately strong opinion against the use of
> "refs/hidden" for the purpose of "prefetch in advance, without
> disrupting refs/remotes".  There may be more than one reason why
> some refs want to be "hidden", and depending on the purose, the
> exact refs that one workflow (e.g. "decorate") wants to hide may be
> the ones that want to be exposed.
> 
> If we rename it to "refs/prefetch/", would it make the purpose of
> the hierarchy clearer without squatting on a vague (because it does
> not tell why it is hidden) name "hidden" that other people might
> want to use to hide their stuff for different reasons?

I like "refs/prefetch". That's more descriptive.

>> Should
>>    we make refs/hidden/* a "special" refspace that is excluded from
>>    decorations?
> 
> See above.
Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-05 15:14   ` Phillip Wood
@ 2020-04-06 12:48     ` Derrick Stolee
  0 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-04-06 12:48 UTC (permalink / raw)
  To: phillip.wood, Derrick Stolee via GitGitGadget, git
  Cc: peff, jrnieder, Derrick Stolee

On 4/5/2020 11:14 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>> RFC QUESTIONS:
>>
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>     decorate commits with twice as many refs if they appear at a
>>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>     there an easy way to exclude a refspace from decorations? Should
>>     we make refs/hidden/* a "special" refspace that is excluded from
>>     decorations?
> 
> Having some way to specify which refs outside of refs/{heads,remote,tags}/ to show or exclude from decorations would be useful I think. Fetching to a hidden ref is a good idea (as are the other steps you outline above) but as you say we don't want it to show up in the output of 'git log' etc.

I'll work on this first. It seems less controversial.

>> +static int fill_remotes(struct string_list *remotes)
> 
> Isn't there a easy way to get this using the config api rather than forking 'git remote'?

You're right. I should have used the config here. I've been overly
biased to how we do it in the C# layer of Scalar _plus_ how I wanted
the job-runner to run "outside" a Git repository. This could be much
simpler with the config API.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
@ 2020-04-13 13:15 Son Luong Ngoc
  2020-04-13 14:14 ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Son Luong Ngoc @ 2020-04-13 13:15 UTC (permalink / raw)
  To: gitgitgadget; +Cc: dstolee, git, jrnieder, peff, stolee

Hi Derrick,

First of all, thanks a ton for upstreaming this.
Despite multiple complaints about re-implementing cron in git,
I see this as a huge improvement to git UX and it is very much welcome change.

> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
Ideally I think we want to let user configure which refs they want to
prefetch with the default behavior being prefecting all HEADS
available from remote.
Using Facebook's Mercurial extension
[RemoteFileLog](https://www.mercurial-scm.org/repo/hg/file/tip/hgext/remotefilelog/__init__.py#l31)
as a UX reference,
users should only prefetch the refs that they actually care about.

> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations? Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?
In git-log, there is
[--decorate-refs-exclude](https://git-scm.com/docs/git-log#Documentation/git-log.txt---decorate-refs-excludeltpatterngt)
which I think we can move into git-config as
`log.decorate-refs-exclude`?
If you let the `prefetch refs` be configurable as I suggested above, I
think it make sense to have the git-log exclusions being configurable
as well.

Cheers,
Son Luong.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-13 13:15 [PATCH 03/15] run-job: implement fetch job Son Luong Ngoc
@ 2020-04-13 14:14 ` Derrick Stolee
  0 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-04-13 14:14 UTC (permalink / raw)
  To: Son Luong Ngoc, gitgitgadget; +Cc: dstolee, git, jrnieder, peff

On 4/13/2020 9:15 AM, Son Luong Ngoc wrote:
> Hi Derrick,
> 
> First of all, thanks a ton for upstreaming this.
> Despite multiple complaints about re-implementing cron in git,
> I see this as a huge improvement to git UX and it is very much welcome change.
> 
>> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>>    we can ensure that we actually load the new values somewhere in
>>    our refspace while not updating refs/heads or refs/remotes. By
>>    storing these refs here, the commit-graph job will update the
>>    commit-graph with the commits from these hidden refs.
> Ideally I think we want to let user configure which refs they want to
> prefetch with the default behavior being prefecting all HEADS
> available from remote.
> Using Facebook's Mercurial extension
> [RemoteFileLog](https://www.mercurial-scm.org/repo/hg/file/tip/hgext/remotefilelog/__init__.py#l31)
> as a UX reference,
> users should only prefetch the refs that they actually care about.

I will be sure to review this prior art before
submitting an update to this series.

>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>    decorate commits with twice as many refs if they appear at a
>>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>    there an easy way to exclude a refspace from decorations? Should
>>    we make refs/hidden/* a "special" refspace that is excluded from
>>    decorations?
> In git-log, there is
> [--decorate-refs-exclude](https://git-scm.com/docs/git-log#Documentation/git-log.txt---decorate-refs-excludeltpatterngt)
> which I think we can move into git-config as
> `log.decorate-refs-exclude`?
> If you let the `prefetch refs` be configurable as I suggested above, I
> think it make sense to have the git-log exclusions being configurable
> as well.

I was literally just working on exactly this when
your message arrived. I was going to name the config
option "log.decorateExclude".

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 03/15] run-job: implement fetch job
  2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
  2020-04-05 15:14   ` Phillip Wood
  2020-04-05 20:28   ` Junio C Hamano
@ 2020-05-20 19:08   ` Josh Steadmon
  2 siblings, 0 replies; 8+ messages in thread
From: Josh Steadmon @ 2020-05-20 19:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, jrnieder, stolee, Derrick Stolee

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.

It might be useful here to clarify the interaction between background
fetching & the expectations around --force-with-lease.


> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations? Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?
> 
> 2. This feature is designed for a desktop machine or equivalent
>    that has a permanent wired network connection, and the machine
>    stays on while the user is not present. For things like laptops
>    with inconsistent WiFi connections (that may be metered) the
>    feature can use the less stable connection more than the user
>    wants. Of course, this feature is opt-in for Git, but in Scalar
>    we have a "scalar pause" command [2] that pauses all maintenance
>    for some amount of time. We should consider a similar mechanism
>    for Git, but for the point of this series the user needs to set
>    up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 13 ++++-
>  builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>  t/t7900-run-job.sh            | 22 +++++++++
>  3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>  
>  
>  DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>  
>  GIT
>  ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>  	NULL
>  };
>  
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));

Shouldn't remote_proc just go on the stack here rather than getting
malloced?

> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  	if (argc > 0) {
>  		if (!strcmp(argv[0], "commit-graph"))
>  			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>  	)
>  '
>  
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-20 19:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-13 13:15 [PATCH 03/15] run-job: implement fetch job Son Luong Ngoc
2020-04-13 14:14 ` Derrick Stolee
  -- strict thread matches above, loose matches on Subject: below --
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).