From: Josh Steadmon <steadmon@google.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, jrnieder@google.com,
stolee@gmail.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 02/15] run-job: implement commit-graph job
Date: Wed, 20 May 2020 12:08:20 -0700 [thread overview]
Message-ID: <20200520190820.GB163566@google.com> (raw)
In-Reply-To: <cd3facd1e03923204bd2259ff0635a5bbf68d700.1585946894.git.gitgitgadget@gmail.com>
Thanks for this series, and sorry for the delayed (and partial) review.
I hope comments here are still useful. I don't have any comments on the
larger design that other people haven't already made, but I do have a
few small questions and suggestions, inline below.
On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The first job to implement in the 'git run-job' command is the
> 'commit-graph' job. It is based on the sequence of events in the
> 'commit-graph' job in Scalar [1]. This sequence is as follows:
>
> 1. git commit-graph write --reachable --split
> 2. git commit-graph verify --shallow
> 3. If the verify succeeds, stop.
> 4. Delete the commit-graph-chain file.
> 5. git commit-graph write --reachable --split
>
> By writing an incremental commit-graph file using the "--split"
> option we minimize the disruption from this operation. The default
> behavior is to merge layers until the new "top" layer is less than
> half the size of the layer below. This provides quick writes "most"
> of the time, with the longer writes following a power law
> distribution.
>
> Most importantly, concurrent Git processes only look at the
> commit-graph-chain file for a very short amount of time, so they
> will verly likely not be holding a handle to the file when we try
> to replace it. (This only matters on Windows.)
>
> If a concurrent process reads the old commit-graph-chain file, but
> our job expires some of the .graph files before they can be read,
> then those processes will see a warning message (but not fail).
> This could be avoided by a future update to use the --expire-time
> argument when writing the commit-graph.
>
> By using 'git commit-graph verify --shallow' we can ensure that
> the file we just wrote is valid. This is an extra safety precaution
> that is faster than our 'write' subcommand. In the rare situation
> that the newest layer of the commit-graph is corrupt, we can "fix"
> the corruption by deleting the commit-graph-chain file and rewrite
> the full commit-graph as a new one-layer commit graph. This does
> not completely prevent _that_ file from being corrupt, but it does
> recompute the commit-graph by parsing commits from the object
> database. In our use of this step in Scalar and VFS for Git, we
> have only seen this issue arise because our microsoft/git fork
> reverted 43d3561 ("commit-graph write: don't die if the existing
> graph is corrupt" 2019-03-25) for a while to keep commit-graph
> writes very fast. We dropped the revert when updating to v2.23.0.
> The verify still has potential for catching corrupt data across
> the layer boundary: if the new file has commit X with parent Y
> in an old file but the commit ID for Y in the old file had a
> bitswap, then we will notice that in the 'verify' command.
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs
>
> RFC QUESTIONS:
>
> 1. Are links like [1] helpful?
Yes, I appreciate being able to reference these. However, given that
these will show up in commit logs for ~forever, and ongoing work might
happen on Scalar, perhaps the links should be pinned to a specific
revision?
>
> 2. Can anyone think of a way to test the rewrite fallback?
> It requires corrupting the latest file between two steps of
> this one command, so that is a tricky spot to inject a fault.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 21 ++++++++++--
> builtin/run-job.c | 60 ++++++++++++++++++++++++++++++++++-
> commit-graph.c | 2 +-
> commit-graph.h | 1 +
> t/t7900-run-job.sh | 32 +++++++++++++++++++
> 5 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 0627b3ed259..8bf2762d650 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job <job-name>'
> +'git run-job commit-graph'
>
>
> DESCRIPTION
> @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
> JOBS
> ----
>
> -TBD
> +'commit-graph'::
> +
> +The `commit-graph` job updates the `commit-graph` files incrementally,
> +then verifies that the written data is correct. If the new layer has an
> +issue, then the chain file is removed and the `commit-graph` is
> +rewritten from scratch.
> ++
> +The verification only checks the top layer of the `commit-graph` chain.
> +If the incremental write merged the new commits with at least one
> +existing layer, then there is potential for on-disk corruption being
> +carried forward into the new file. This will be noticed and the new
> +commit-graph file will be clean as Git reparses the commit data from
> +the object database.
> ++
> +The incremental write is safe to run alongside concurrent Git processes
> +since it will not expire `.graph` files that were in the previous
> +`commit-graph-chain` file. They will be deleted by a later run based on
> +the expiration delay.
>
>
> GIT
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index 2c78d053aa4..dd7709952d3 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,12 +1,65 @@
> #include "builtin.h"
> #include "config.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> #include "parse-options.h"
> +#include "repository.h"
> +#include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job"),
> + N_("git run-job commit-graph"),
> NULL
> };
>
> +static int run_write_commit_graph(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> + argv_array_pushl(&cmd, "commit-graph", "write",
> + "--split", "--reachable", "--no-progress",
> + NULL);
> +
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_verify_commit_graph(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> + argv_array_pushl(&cmd, "commit-graph", "verify",
> + "--shallow", "--no-progress", NULL);
> +
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_commit_graph_job(void)
> +{
> + char *chain_path;
> +
> + if (run_write_commit_graph()) {
> + error(_("failed to write commit-graph"));
> + return 1;
> + }
> +
> + if (!run_verify_commit_graph())
> + return 0;
> +
> + warning(_("commit-graph verify caught error, rewriting"));
> +
> + chain_path = get_chain_filename(the_repository->objects->odb);
Should we avoid new uses of `the_repository` and take a repo pointer
argument here instead?
> + if (unlink(chain_path)) {
> + UNLEAK(chain_path);
> + die(_("failed to remove commit-graph at %s"), chain_path);
> + }
> + free(chain_path);
> +
> + if (!run_write_commit_graph())
> + return 0;
Is there a reason why we don't verify the second write attempt?
> +
> + error(_("failed to rewrite commit-graph"));
> + return 1;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> builtin_run_job_usage,
> PARSE_OPT_KEEP_UNKNOWN);
>
> + if (argc > 0) {
> + if (!strcmp(argv[0], "commit-graph"))
> + return run_commit_graph_job();
> + }
> +
> usage_with_options(builtin_run_job_usage,
> builtin_run_job_options);
> }
> diff --git a/commit-graph.c b/commit-graph.c
> index f013a84e294..6867f92d04a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
> oid_hex);
> }
>
> -static char *get_chain_filename(struct object_directory *odb)
> +char *get_chain_filename(struct object_directory *odb)
> {
> return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
> }
> diff --git a/commit-graph.h b/commit-graph.h
> index e87a6f63600..8b7bd5a6cb1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,6 +13,7 @@
> struct commit;
>
> char *get_commit_graph_filename(struct object_directory *odb);
> +char *get_chain_filename(struct object_directory *odb);
> int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>
> /*
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 1eac80b7ed3..18b9bd26b3a 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -5,6 +5,8 @@ test_description='git run-job
> Testing the background jobs, in the foreground
> '
>
> +GIT_TEST_COMMIT_GRAPH=0
> +
> . ./test-lib.sh
>
> test_expect_success 'help text' '
> @@ -12,4 +14,34 @@ test_expect_success 'help text' '
> test_i18ngrep "usage: git run-job" err
> '
>
> +test_expect_success 'commit-graph job' '
> + git init server &&
> + (
> + cd server &&
> + chain=.git/objects/info/commit-graphs/commit-graph-chain &&
> + git checkout -b master &&
> + for i in $(test_seq 1 15)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 1 $chain &&
> + for i in $(test_seq 16 19)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 2 $chain &&
> + for i in $(test_seq 20 23)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 1 $chain
> + )
> +'
> +
> test_done
> --
> gitgitgadget
>
next prev parent reply other threads:[~2020-05-20 19:08 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10 ` Phillip Wood
2020-04-05 19:21 ` Junio C Hamano
2020-04-06 14:42 ` Derrick Stolee
2020-04-07 0:58 ` Danh Doan
2020-04-07 10:54 ` Derrick Stolee
2020-04-07 14:16 ` Danh Doan
2020-04-07 14:30 ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08 ` Josh Steadmon [this message]
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14 ` Phillip Wood
2020-04-06 12:48 ` Derrick Stolee
2020-04-05 20:28 ` Junio C Hamano
2020-04-06 12:46 ` Derrick Stolee
2020-05-20 19:08 ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33 ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17 ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18 ` Phillip Wood
2020-04-06 12:49 ` Derrick Stolee
2020-04-05 15:41 ` Phillip Wood
2020-04-06 12:57 ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24 ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04 0:16 ` Derrick Stolee
2020-04-07 0:50 ` Danh Doan
2020-04-07 10:59 ` Derrick Stolee
2020-04-07 14:26 ` Danh Doan
2020-04-07 14:43 ` Johannes Schindelin
2020-04-07 1:48 ` brian m. carlson
2020-04-07 20:08 ` Junio C Hamano
2020-04-07 22:23 ` Johannes Schindelin
2020-04-08 0:01 ` brian m. carlson
2020-05-27 22:39 ` Josh Steadmon
2020-05-28 0:47 ` Junio C Hamano
2020-05-27 21:52 ` Johannes Schindelin
2020-05-28 14:48 ` Junio C Hamano
2020-05-28 14:50 ` Jonathan Nieder
2020-05-28 14:57 ` Junio C Hamano
2020-05-28 15:03 ` Jonathan Nieder
2020-05-28 15:30 ` Derrick Stolee
2020-05-28 4:39 ` Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200520190820.GB163566@google.com \
--to=steadmon@google.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=jrnieder@google.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).