From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: sandals@crustytoothpaste.net, steadmon@google.com,
jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
phillip.wood123@gmail.com, emilyshaffer@google.com,
sluongng@gmail.com, jonathantanmy@google.com,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v2 2/9] maintenance: add prefetch task
Date: Tue, 18 Aug 2020 14:25:23 +0000 [thread overview]
Message-ID: <8779c6c20d7e25e13189074dbd57a86b49ec56e9.1597760730.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.696.v2.git.1597760730.gitgitgadget@gmail.com>
From: Derrick Stolee <dstolee@microsoft.com>
When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.
Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.
The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.
However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.
When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:
1. --no-tags prevents getting a new tag when a user wants to see
the new tags appear in their foreground fetches.
2. --refmap= removes the configured refspec which usually updates
refs/remotes/<remote>/* with the refs advertised by the remote.
While this looks confusing, this was documented and tested by
b40a50264ac (fetch: document and test --refmap="", 2020-01-21),
including this sentence in the documentation:
Providing an empty `<refspec>` to the `--refmap` option
causes Git to ignore the configured refspecs and rely
entirely on the refspecs supplied as command-line arguments.
3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
we can ensure that we actually load the new values somewhere in
our refspace while not updating refs/heads or refs/remotes. By
storing these refs here, the commit-graph job will update the
commit-graph with the commits from these hidden refs.
4. --prune will delete the refs/prefetch/<remote> refs that no
longer appear on the remote.
5. --no-write-fetch-head prevents updating FETCH_HEAD.
We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/git-maintenance.txt | 15 +++++++++
builtin/gc.c | 51 +++++++++++++++++++++++++++++++
t/t7900-maintenance.sh | 26 ++++++++++++++++
3 files changed, 92 insertions(+)
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 9af08c644f..e82799ccff 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -50,6 +50,21 @@ since it will not expire `.graph` files that were in the previous
`commit-graph-chain` file. They will be deleted by a later run based on
the expiration delay.
+prefetch::
+ The `prefetch` task updates the object directory with the latest
+ objects from all registered remotes. For each remote, a `git fetch`
+ command is run. The refmap is custom to avoid updating local or remote
+ branches (those in `refs/heads` or `refs/remotes`). Instead, the
+ remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
+ not updated.
++
+This is done to avoid disrupting the remote-tracking branches. The end users
+expect these refs to stay unmoved unless they initiate a fetch. With prefetch
+task, however, the objects necessary to complete a later real fetch would
+already be obtained, so the real fetch would go faster. In the ideal case,
+it will just become an update to bunch of remote-tracking branches without
+any object transfer.
+
gc::
Clean up unnecessary files and optimize the local repository. "GC"
stands for "garbage collection," but this task performs many
diff --git a/builtin/gc.c b/builtin/gc.c
index 3fdb08655c..2ac08cc740 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -29,6 +29,7 @@
#include "tree.h"
#include "promisor-remote.h"
#include "refs.h"
+#include "remote.h"
#define FAILED_RUN "failed to run %s"
@@ -843,6 +844,51 @@ static int maintenance_task_commit_graph(struct maintenance_opts *opts)
return 1;
}
+static int fetch_remote(const char *remote, struct maintenance_opts *opts)
+{
+ struct child_process child = CHILD_PROCESS_INIT;
+
+ child.git_cmd = 1;
+ strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
+ "--no-write-fetch-head", "--recurse-submodules=no",
+ "--refmap=", NULL);
+
+ if (opts->quiet)
+ strvec_push(&child.args, "--quiet");
+
+ strvec_pushf(&child.args, "+refs/heads/*:refs/prefetch/%s/*", remote);
+
+ return !!run_command(&child);
+}
+
+static int append_remote(struct remote *remote, void *cbdata)
+{
+ struct string_list *remotes = (struct string_list *)cbdata;
+
+ string_list_append(remotes, remote->name);
+ return 0;
+}
+
+static int maintenance_task_prefetch(struct maintenance_opts *opts)
+{
+ int result = 0;
+ struct string_list_item *item;
+ struct string_list remotes = STRING_LIST_INIT_DUP;
+
+ if (for_each_remote(append_remote, &remotes)) {
+ error(_("failed to fill remotes"));
+ result = 1;
+ goto cleanup;
+ }
+
+ for_each_string_list_item(item, &remotes)
+ result |= fetch_remote(item->string, opts);
+
+cleanup:
+ string_list_clear(&remotes, 0);
+ return result;
+}
+
static int maintenance_task_gc(struct maintenance_opts *opts)
{
struct child_process child = CHILD_PROCESS_INIT;
@@ -880,6 +926,7 @@ struct maintenance_task {
};
enum maintenance_task_label {
+ TASK_PREFETCH,
TASK_GC,
TASK_COMMIT_GRAPH,
@@ -888,6 +935,10 @@ enum maintenance_task_label {
};
static struct maintenance_task tasks[] = {
+ [TASK_PREFETCH] = {
+ "prefetch",
+ maintenance_task_prefetch,
+ },
[TASK_GC] = {
"gc",
maintenance_task_gc,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 4f6a04ddb1..0bade09c43 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -60,4 +60,30 @@ test_expect_success 'run --task duplicate' '
test_i18ngrep "cannot be selected multiple times" err
'
+test_expect_success 'run --task=prefetch with no remotes' '
+ git maintenance run --task=prefetch 2>err &&
+ test_must_be_empty err
+'
+
+test_expect_success 'prefetch multiple remotes' '
+ git clone . clone1 &&
+ git clone . clone2 &&
+ git remote add remote1 "file://$(pwd)/clone1" &&
+ git remote add remote2 "file://$(pwd)/clone2" &&
+ git -C clone1 switch -c one &&
+ git -C clone2 switch -c two &&
+ test_commit -C clone1 one &&
+ test_commit -C clone2 two &&
+ GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch 2>/dev/null &&
+ fetchargs="--prune --no-tags --no-write-fetch-head --recurse-submodules=no --refmap= --quiet" &&
+ test_subcommand git fetch remote1 $fetchargs +refs/heads/\\*:refs/prefetch/remote1/\\* <run-prefetch.txt &&
+ test_subcommand git fetch remote2 $fetchargs +refs/heads/\\*:refs/prefetch/remote2/\\* <run-prefetch.txt &&
+ test_path_is_missing .git/refs/remotes &&
+ git log prefetch/remote1/one &&
+ git log prefetch/remote2/two &&
+ git fetch --all &&
+ test_cmp_rev refs/remotes/remote1/one refs/prefetch/remote1/one &&
+ test_cmp_rev refs/remotes/remote2/two refs/prefetch/remote2/two
+'
+
test_done
--
gitgitgadget
next prev parent reply other threads:[~2020-08-18 14:26 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-06 16:30 [PATCH 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-13 0:03 ` Junio C Hamano
2020-08-13 1:45 ` Jonathan Nieder
2020-08-13 4:37 ` [PATCH v3] " Junio C Hamano
2020-08-14 1:13 ` Derrick Stolee
2020-08-14 1:32 ` Junio C Hamano
2020-08-06 16:30 ` [PATCH 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-14 1:28 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-14 1:46 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-06 17:02 ` Son Luong Ngoc
2020-08-06 18:13 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-18 14:25 ` Derrick Stolee via GitGitGadget [this message]
2020-08-18 14:25 ` [PATCH v2 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-22 23:05 ` Jonathan Tan
2020-08-25 18:36 ` [PATCH v3 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-22 23:09 ` Jonathan Tan
2020-09-24 13:45 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-22 23:15 ` Jonathan Tan
2020-09-24 13:51 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-22 23:16 ` Jonathan Tan
2020-09-24 13:53 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-22 23:26 ` Jonathan Tan
2020-09-24 14:05 ` Derrick Stolee
2020-09-24 22:01 ` Jonathan Tan
2020-08-25 18:36 ` [PATCH v3 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-09-22 23:52 ` Jonathan Tan
2020-08-25 20:59 ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Junio C Hamano
2020-08-26 15:15 ` Son Luong Ngoc
2020-08-26 16:21 ` Derrick Stolee
2020-09-25 12:33 ` [PATCH v4 " Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-25 18:00 ` Junio C Hamano
2020-09-25 18:43 ` Derrick Stolee
2020-09-25 12:33 ` [PATCH v4 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8779c6c20d7e25e13189074dbd57a86b49ec56e9.1597760730.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=congdanhqx@gmail.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=emilyshaffer@google.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=peff@peff.net \
--cc=phillip.wood123@gmail.com \
--cc=sandals@crustytoothpaste.net \
--cc=sluongng@gmail.com \
--cc=steadmon@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).