git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
@ 2018-04-17 18:10 Derrick Stolee
  2018-04-17 18:10 ` [RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information Derrick Stolee
                   ` (13 more replies)
  0 siblings, 14 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
  but may not be required to start. The goal is to have this default to
  true eventually, but we may want to delay that until the feature is
  stable.

* I implement a "--reachable" option to 'git commit-graph write' that
  iterates over all refs. This does the same as 

	git show-ref -s | git commit-graph write --stdin-commits

  but I don't know how to pipe two child processes together inside of Git.
  Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
  fixup! commit-graph: always load commit-graph information
  commit-graph: add 'check' subcommand
  commit-graph: check file header information
  commit-graph: parse commit from chosen graph
  commit-graph: check fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: check commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/git-commit-graph.txt       |  15 +-
 Documentation/git-gc.txt                 |   4 +
 Documentation/technical/commit-graph.txt |   9 --
 builtin/commit-graph.c                   |  79 +++++++++-
 builtin/fsck.c                           |  13 ++
 builtin/gc.c                             |   8 +
 commit-graph.c                           | 178 ++++++++++++++++++++++-
 commit-graph.h                           |   2 +
 commit.c                                 |  14 +-
 commit.h                                 |   1 +
 t/t5318-commit-graph.sh                  |  15 ++
 11 files changed, 311 insertions(+), 27 deletions(-)

-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-17 18:10 ` [RFC PATCH 02/12] commit-graph: add 'check' subcommand Derrick Stolee
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..3f0c142603 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 		*pos = item->graph_pos;
 		return 1;
 	} else {
-		return bsearch_graph(commit_graph, &(item->object.oid), pos);
+		return bsearch_graph(g, &(item->object.oid), pos);
 	}
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 03/12] commit-graph: check file header information
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  2018-04-17 18:10 ` [RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information Derrick Stolee
  2018-04-17 18:10 ` [RFC PATCH 02/12] commit-graph: add 'check' subcommand Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-19 15:58   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

During a run of 'git commit-graph check', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index cd0634bba0..c5e5a0f860 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir,
 	oids.nr = 0;
 }
 
+static int check_commit_graph_error;
+#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); }
+
 int check_commit_graph(struct commit_graph *g)
 {
-	return !g;
+	if (!g) {
+		graph_report(_("no commit-graph file loaded"));
+		return 1;
+	}
+
+	check_commit_graph_error = 0;
+
+	if (get_be32(g->data) != GRAPH_SIGNATURE)
+		graph_report(_("commit-graph file has incorrect header"));
+
+	if (*(g->data + 4) != 1)
+		graph_report(_("commit-graph file version is not 1"));
+	if (*(g->data + 5) != GRAPH_OID_VERSION)
+		graph_report(_("commit-graph OID version is not 1 (SHA1)"));
+
+	if (!g->chunk_oid_fanout)
+		graph_report(_("commit-graph is missing the OID Fanout chunk"));
+	if (!g->chunk_oid_lookup)
+		graph_report(_("commit-graph is missing the OID Lookup chunk"));
+	if (!g->chunk_commit_data)
+		graph_report(_("commit-graph is missing the Commit Data chunk"));
+	if (g->hash_len != GRAPH_OID_LEN)
+		graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len);
+
+	return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 02/12] commit-graph: add 'check' subcommand
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  2018-04-17 18:10 ` [RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-19 13:24   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 03/12] commit-graph: check file header information Derrick Stolee
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

If the commit-graph file becomes corrupt, we need a way to verify
its contents match the object database. In the manner of 'git fsck'
we will implement a 'git commit-graph check' subcommand to report
all issues with the file.

Add the 'check' subcommand to the 'commit-graph' builtin and its
documentation. Add a simple test that ensures the command returns
a zero error code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 +++++-
 builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++
 commit-graph.c                     |  5 ++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            |  5 ++++
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 4c97b555cc..93c7841ba2 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 --------
 [verse]
+'git commit-graph check' [--object-dir <dir>]
 'git commit-graph read' [--object-dir <dir>]
 'git commit-graph write' <options> [--object-dir <dir>]
 
-
 DESCRIPTION
 -----------
 
@@ -52,6 +52,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'check'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 --------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..77c1a04932 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
+	N_("git commit-graph check [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_check_usage[] = {
+	N_("git commit-graph check [--object-dir <objdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
 	int append;
 } opts;
 
+
+static int graph_check(int argc, const char **argv)
+{
+	struct commit_graph *graph = 0;
+	char *graph_name;
+
+	static struct option builtin_commit_graph_check_options[] = {
+		OPT_STRING(0, "object-dir", &opts.obj_dir,
+			   N_("dir"),
+			   N_("The object directory to store the graph")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_check_options,
+			     builtin_commit_graph_check_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	graph_name = get_commit_graph_filename(opts.obj_dir);
+	graph = load_commit_graph_one(graph_name);
+
+	if (!graph)
+		die("graph file %s does not exist", graph_name);
+	FREE_AND_NULL(graph_name);
+
+	return check_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "check"))
+			return graph_check(argc, argv);
 		if (!strcmp(argv[0], "read"))
 			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 3f0c142603..cd0634bba0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir,
 	oids.alloc = 0;
 	oids.nr = 0;
 }
+
+int check_commit_graph(struct commit_graph *g)
+{
+	return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..e8c8d99dff 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
 			int nr_commits,
 			int append);
 
+int check_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..e91053271a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+test_expect_success 'git commit-graph check' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph check >output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 04/12] commit-graph: parse commit from chosen graph
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 03/12] commit-graph: check file header information Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-19 17:21   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 05/12] commit-graph: check fanout and lookup table Derrick Stolee
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

Before checking a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c5e5a0f860..6d0d303a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
 	uint32_t pos;
 
 	if (item->object.parsed)
-		return 0;
+		return 1;
+
+	if (find_commit_in_graph(item, g, &pos))
+		return fill_commit_in_graph(item, g, pos);
+
+	return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
 	if (!core_commit_graph)
 		return 0;
+
 	prepare_commit_graph();
-	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-		return fill_commit_in_graph(item, commit_graph, pos);
+	if (commit_graph)
+		return parse_commit_in_graph_one(commit_graph, item);
 	return 0;
 }
 
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 05/12] commit-graph: check fanout and lookup table
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (3 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20  7:27   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 06/12] commit: force commit to parse from object database Derrick Stolee
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

While running 'git commit-graph check', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

In anticipation of checking the commits in the commit-graph file
against the object database, parse the commits from that file in
advance. We perform this parse now to ensure the object cache contains
only commits from this commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 6d0d303a7a..6e3c08cd5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -835,6 +835,9 @@ static int check_commit_graph_error;
 
 int check_commit_graph(struct commit_graph *g)
 {
+	uint32_t i, cur_fanout_pos = 0;
+	struct object_id prev_oid, cur_oid;
+
 	if (!g) {
 		graph_report(_("no commit-graph file loaded"));
 		return 1;
@@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g)
 	if (g->hash_len != GRAPH_OID_LEN)
 		graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len);
 
+	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit;
+
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
+			graph_report(_("commit-graph has incorrect oid order: %s then %s"),
+
+		oid_to_hex(&prev_oid),
+		oid_to_hex(&cur_oid));
+		oidcpy(&prev_oid, &cur_oid);
+
+		while (cur_oid.hash[0] > cur_fanout_pos) {
+			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+			if (i != fanout_value)
+				graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
+					     cur_fanout_pos, fanout_value, i);
+
+			cur_fanout_pos++;
+		}
+
+		graph_commit = lookup_commit(&cur_oid);
+
+		if (!parse_commit_in_graph_one(g, graph_commit))
+			graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(&cur_oid));
+
+		if (graph_commit->graph_pos != i)
+			graph_report(_("graph_pos for commit %s is %u != %u"), oid_to_hex(&cur_oid),
+				     graph_commit->graph_pos, i);
+	}
+
 	return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 06/12] commit: force commit to parse from object database
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (4 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 05/12] commit-graph: check fanout and lookup table Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 12:13   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

In anticipation of checking commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 14 ++++++++++----
 commit.h |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 9ef6f699bd..07752d8503 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
 {
 	enum object_type type;
 	void *buffer;
@@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (parse_commit_in_graph(item))
+	if (use_commit_graph && parse_commit_in_graph(item))
 		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
 			error("Could not read %s",
-			     oid_to_hex(&item->object.oid));
+					oid_to_hex(&item->object.oid));
 	if (type != OBJ_COMMIT) {
 		free(buffer);
 		return error("Object %s not a commit",
-			     oid_to_hex(&item->object.oid));
+				oid_to_hex(&item->object.oid));
 	}
 	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
@@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 	return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+	return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
 	if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 07/12] commit-graph: load a root tree from specific graph
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (5 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 06/12] commit: force commit to parse from object database Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 12:18   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 08/12] commit-graph: verify commit contents against odb Derrick Stolee
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6e3c08cd5c..80a2ac2a6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *
 	return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+						 const struct commit *c)
 {
 	if (c->maybe_tree)
 		return c->maybe_tree;
 	if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
 		BUG("get_commit_tree_in_graph called from non-commit-graph commit");
 
-	return load_tree_for_commit(commit_graph, (struct commit *)c);
+	return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+	return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 08/12] commit-graph: verify commit contents against odb
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (6 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 16:47   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 10/12] commit-graph: add '--reachable' option Derrick Stolee
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

When running 'git commit-graph check', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 80a2ac2a6a..b5bce2bac4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g)
 				     graph_commit->graph_pos, i);
 	}
 
+	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit, *odb_commit;
+		struct commit_list *graph_parents, *odb_parents;
+		int num_parents = 0;
+
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		graph_commit = lookup_commit(&cur_oid);
+		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
+		if (parse_commit_internal(odb_commit, 0, 0))
+			graph_report(_("failed to parse %s from object database"), oid_to_hex(&cur_oid));
+
+		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
+			   get_commit_tree_oid(odb_commit)))
+			graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"),
+				     oid_to_hex(&cur_oid),
+				     oid_to_hex(get_commit_tree_oid(graph_commit)),
+				     oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+		if (graph_commit->date != odb_commit->date)
+			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""),
+				     oid_to_hex(&cur_oid),
+				     graph_commit->date,
+				     odb_commit->date);
+
+
+		graph_parents = graph_commit->parents;
+		odb_parents = odb_commit->parents;
+
+		while (graph_parents) {
+			num_parents++;
+
+			if (odb_parents == NULL)
+				graph_report(_("commit-graph parent list for commit %s is too long (%d)"),
+					     oid_to_hex(&cur_oid),
+					     num_parents);
+
+			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
+				graph_report(_("commit-graph parent for %s is %s != %s"),
+					     oid_to_hex(&cur_oid),
+					     oid_to_hex(&graph_parents->item->object.oid),
+					     oid_to_hex(&odb_parents->item->object.oid));
+
+			graph_parents = graph_parents->next;
+			odb_parents = odb_parents->next;
+		}
+
+		if (odb_parents != NULL)
+			graph_report(_("commit-graph parent list for commit %s terminates early"),
+				     oid_to_hex(&cur_oid));
+
+		if (graph_commit->generation) {
+			uint32_t max_generation = 0;
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
+				    graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
+					graph_report(_("commit-graph has valid generation for %s but not its parent, %s"),
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				if (graph_parents->item->generation > max_generation)
+					max_generation = graph_parents->item->generation;
+				graph_parents = graph_parents->next;
+			}
+
+			if (graph_commit->generation != max_generation + 1)
+				graph_report(_("commit-graph has incorrect generation for %s"),
+					     oid_to_hex(&cur_oid));
+		} else {
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation)
+					graph_report(_("commit-graph has generation ZERO for %s but not its parent, %s"),
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				graph_parents = graph_parents->next;
+			}
+		}
+	}
+
 	return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 09/12] fsck: check commit-graph
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (8 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 10/12] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 16:59   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 11/12] gc: automatically write commit-graph files Derrick Stolee
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

If a commit-graph file exists, check its contents during 'git fsck'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/fsck.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..9712f230ba 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	}
 
 	check_connectivity();
+
+	if (core_commit_graph) {
+		struct child_process commit_graph_check = CHILD_PROCESS_INIT;
+		const char *check_argv[] = { "commit-graph", "check", NULL, NULL };
+		commit_graph_check.argv = check_argv;
+		commit_graph_check.git_cmd = 1;
+
+		if (run_command(&commit_graph_check))
+			errors_found |= ERROR_COMMIT_GRAPH;
+	}
+
 	return errors_found;
 }
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 10/12] commit-graph: add '--reachable' option
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (7 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 08/12] commit-graph: verify commit contents against odb Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 17:17   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 09/12] fsck: check commit-graph Derrick Stolee
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 ++++--
 builtin/commit-graph.c             | 41 +++++++++++++++++++++++++++---
 t/t5318-commit-graph.sh            | 10 ++++++++
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 93c7841ba2..1b14d40590 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 77c1a04932..a89285ada8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph check [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *obj_dir;
+	int reachable;
 	int stdin_packs;
 	int stdin_commits;
 	int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
 	return 0;
 }
 
+struct hex_list {
+	char **hex_strs;
+	int hex_nr;
+	int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+			   const struct object_id *oid,
+			   int flags, void *cb_data)
+{
+	struct hex_list *list = (struct hex_list*)cb_data;
+
+	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+	list->hex_nr++;
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph")),
+		OPT_BOOL(0, "reachable", &opts.reachable,
+			N_("start walk at all refs")),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			N_("scan pack-indexes listed by stdin for commits")),
 		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
-	if (opts.stdin_packs && opts.stdin_commits)
-		die(_("cannot use both --stdin-commits and --stdin-packs"));
+	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
@@ -164,6 +187,16 @@ static int graph_write(int argc, const char **argv)
 			commit_hex = lines;
 			commits_nr = lines_nr;
 		}
+	} else if (opts.reachable) {
+		struct hex_list list;
+		list.hex_nr = 0;
+		list.hex_alloc = 128;
+		ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
+
+		for_each_ref(add_ref_to_list, &list);
+
+		commit_hex = (const char **)list.hex_strs;
+		commits_nr = list.hex_nr;
 	}
 
 	write_commit_graph(opts.obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index e91053271a..ccadd22f57 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -200,6 +200,16 @@ test_expect_success 'build graph from commits with append' '
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'build graph using --reachable' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph write --reachable &&
+	test_path_is_file $objdir/info/commit-graph &&
+	graph_read_expect "11" "large_edges"
+'
+
+graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd "$TRASH_DIRECTORY" &&
 	git clone --bare --no-local full bare &&
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 11/12] gc: automatically write commit-graph files
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (9 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 09/12] fsck: check commit-graph Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 17:34   ` Jakub Narebski
  2018-04-17 18:10 ` [RFC PATCH 12/12] commit-graph: update design document Derrick Stolee
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-gc.txt | 4 ++++
 builtin/gc.c             | 8 ++++++++
 2 files changed, 12 insertions(+)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..070f2a7a3d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
 	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	argv_array_pushl(&prune, "prune", "--expire", NULL);
 	argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", NULL);
 	argv_array_pushl(&rerere, "rerere", "gc", NULL);
+	argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", NULL);
 
 	/* default expiry time, overwritten in gc_config */
 	gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();
 
+	if (gc_commit_graph)
+		if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+			return error(FAILED_RUN, commit_graph.argv[0]);
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [RFC PATCH 12/12] commit-graph: update design document
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (10 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 11/12] gc: automatically write commit-graph files Derrick Stolee
@ 2018-04-17 18:10 ` Derrick Stolee
  2018-04-20 19:10   ` Jakub Narebski
  2018-04-17 18:50 ` [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
  13 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:10 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, Derrick Stolee

The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index d9f2713efa..d04657b781 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -142,12 +139,6 @@ Future Work
   such as "ensure_tree_loaded(commit)" that fully loads a tree before
   using commit->tree.
 
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.17.0.39.g685157f7fb


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (11 preceding siblings ...)
  2018-04-17 18:10 ` [RFC PATCH 12/12] commit-graph: update design document Derrick Stolee
@ 2018-04-17 18:50 ` Derrick Stolee
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
  13 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-04-17 18:50 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: peff, sbeller

On 4/17/2018 2:10 PM, Derrick Stolee wrote:
> The commit-graph feature is not useful to end users until the
> commit-graph file is maintained automatically by Git during normal
> upkeep operations. One natural place to trigger this write is during
> 'git gc'.
>
> Before automatically generating a commit-graph, we need to be able to
> verify the contents of a commit-graph file. Integrate commit-graph
> checks into 'fsck' that check the commit-graph contents against commits
> in the object database.
>
> Things to think about:
>
> * Are these the right integration points?
>
> * gc.commitGraph defaults to true right now for the purpose of testing,
>    but may not be required to start. The goal is to have this default to
>    true eventually, but we may want to delay that until the feature is
>    stable.
>
> * I implement a "--reachable" option to 'git commit-graph write' that
>    iterates over all refs. This does the same as
>
> 	git show-ref -s | git commit-graph write --stdin-commits
>
>    but I don't know how to pipe two child processes together inside of Git.
>    Perhaps this is a better solution, anyway.
>
> What other things should I be considering in this case? I'm unfamiliar
> with the inner-workings of 'fsck' and 'gc', so this is a new space for me.
>
> This RFC is based on v3 of ds/generation-numbers, and the first commit
> is a fixup! based on a bug in that version that I caught while prepping
> this series.
>
> Thanks,
> -Stolee
>
> Derrick Stolee (12):
>    fixup! commit-graph: always load commit-graph information
>    commit-graph: add 'check' subcommand
>    commit-graph: check file header information
>    commit-graph: parse commit from chosen graph
>    commit-graph: check fanout and lookup table
>    commit: force commit to parse from object database
>    commit-graph: load a root tree from specific graph
>    commit-graph: verify commit contents against odb
>    fsck: check commit-graph
>    commit-graph: add '--reachable' option
>    gc: automatically write commit-graph files
>    commit-graph: update design document
>
>   Documentation/git-commit-graph.txt       |  15 +-
>   Documentation/git-gc.txt                 |   4 +
>   Documentation/technical/commit-graph.txt |   9 --
>   builtin/commit-graph.c                   |  79 +++++++++-
>   builtin/fsck.c                           |  13 ++
>   builtin/gc.c                             |   8 +
>   commit-graph.c                           | 178 ++++++++++++++++++++++-
>   commit-graph.h                           |   2 +
>   commit.c                                 |  14 +-
>   commit.h                                 |   1 +
>   t/t5318-commit-graph.sh                  |  15 ++
>   11 files changed, 311 insertions(+), 27 deletions(-)

This RFC is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 02/12] commit-graph: add 'check' subcommand
  2018-04-17 18:10 ` [RFC PATCH 02/12] commit-graph: add 'check' subcommand Derrick Stolee
@ 2018-04-19 13:24   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-19 13:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff\, sbeller\

Derrick Stolee <dstolee@microsoft.com> writes:

> If the commit-graph file becomes corrupt, we need a way to verify
> its contents match the object database. In the manner of 'git fsck'
> we will implement a 'git commit-graph check' subcommand to report
> all issues with the file.

Bikeshed: should the subcommand be called 'check' or 'verify'?

>
> Add the 'check' subcommand to the 'commit-graph' builtin and its
> documentation. Add a simple test that ensures the command returns
> a zero error code.

It would be nice to have the information that the 'check' subcommand is
currently a [almost no-op] stub in the subject... but that might not
have been possible to fit.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |  7 +++++-
>  builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++
>  commit-graph.c                     |  5 ++++
>  commit-graph.h                     |  2 ++
>  t/t5318-commit-graph.sh            |  5 ++++
>  5 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 4c97b555cc..93c7841ba2 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files
>  SYNOPSIS
>  --------
>  [verse]
> +'git commit-graph check' [--object-dir <dir>]
>  'git commit-graph read' [--object-dir <dir>]
>  'git commit-graph write' <options> [--object-dir <dir>]

I still think that [--object-dir <dir>] should be the optional parameter
to the "git" wrapper, not to the "git commit-graph" command, i.e.

   'git [--object-dir=<dir>] commit-graph <command>'

But this can be done later, in a separate patch series.

>  
> -

Stray change.

>  DESCRIPTION
>  -----------
>  
> @@ -52,6 +52,11 @@ existing commit-graph file.
>  Read a graph file given by the commit-graph file and output basic
>  details about the graph file. Used for debugging purposes.
>  
> +'check'::
> +
> +Read the commit-graph file and verify its contents against the object
> +database. Used to check for corrupted data.
> +

I wonder if we should offer to verify file without checking against the
object database (which is the costly part, I think).  But this too can
be added later if needed.

>  
>  EXAMPLES
>  --------
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 37420ae0fd..77c1a04932 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -7,11 +7,17 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--object-dir <objdir>]"),
> +	N_("git commit-graph check [--object-dir <objdir>]"),
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>  	NULL

Isn't the case that each command would support the
[--object-dir <objdir>] parameter?

>  };
>  
> +static const char * const builtin_commit_graph_check_usage[] = {
> +	N_("git commit-graph check [--object-dir <objdir>]"),
> +	NULL
> +};
> +

Looks good to me.

>  static const char * const builtin_commit_graph_read_usage[] = {
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	NULL
> @@ -29,6 +35,36 @@ static struct opts_commit_graph {
>  	int append;
>  } opts;
>  
> +
> +static int graph_check(int argc, const char **argv)
> +{
> +	struct commit_graph *graph = 0;

This is NULL, isn't it?  Shouldn't it be stated as such?

> +	char *graph_name;
> +
> +	static struct option builtin_commit_graph_check_options[] = {
> +		OPT_STRING(0, "object-dir", &opts.obj_dir,
> +			   N_("dir"),
> +			   N_("The object directory to store the graph")),

This again is not the directory to _store_ the graph; this is the
directory to _read_ graph from, or directory where the commit graph is
_stored_.

> +		OPT_END(),
> +	};
> +
> +	argc = parse_options(argc, argv, NULL,
> +			     builtin_commit_graph_check_options,
> +			     builtin_commit_graph_check_usage, 0);
> +
> +	if (!opts.obj_dir)
> +		opts.obj_dir = get_object_directory();
> +
> +	graph_name = get_commit_graph_filename(opts.obj_dir);
> +	graph = load_commit_graph_one(graph_name);
> +
> +	if (!graph)
> +		die("graph file %s does not exist", graph_name);

Shouldn't we quote pathname?  Shouldn't this error message be marked for
translation?  Shouldn't we use "commit graph file" explicitly?

> +	FREE_AND_NULL(graph_name);
> +
> +	return check_commit_graph(graph);
> +}
> +
>  static int graph_read(int argc, const char **argv)
>  {
>  	struct commit_graph *graph = NULL;
> @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			     PARSE_OPT_STOP_AT_NON_OPTION);
>  
>  	if (argc > 0) {
> +		if (!strcmp(argv[0], "check"))
> +			return graph_check(argc, argv);
>  		if (!strcmp(argv[0], "read"))
>  			return graph_read(argc, argv);
>  		if (!strcmp(argv[0], "write"))
> diff --git a/commit-graph.c b/commit-graph.c
> index 3f0c142603..cd0634bba0 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir,
>  	oids.alloc = 0;
>  	oids.nr = 0;
>  }
> +
> +int check_commit_graph(struct commit_graph *g)
> +{
> +	return !g;
> +}

I understand that it is just a start of implementing this feature, but
it looks a bit strange that 'read' command does more sanity checks that
the 'check' command...

> diff --git a/commit-graph.h b/commit-graph.h
> index 96cccb10f3..e8c8d99dff 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
>  			int nr_commits,
>  			int append);
>  
> +int check_commit_graph(struct commit_graph *g);
> +
>  #endif
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 77d85aefe7..e91053271a 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
>  	test_cmp expect output
>  '
>  
> +test_expect_success 'git commit-graph check' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph check >output
> +'

There should also be negative check, that 'git commit-graph check' fails
if there is no commit-graph file, isn't it?

> +
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 03/12] commit-graph: check file header information
  2018-04-17 18:10 ` [RFC PATCH 03/12] commit-graph: check file header information Derrick Stolee
@ 2018-04-19 15:58   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-19 15:58 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> During a run of 'git commit-graph check', list the issues with the
> header information in the commit-graph file. Some of this information
> is inferred from the loaded 'struct commit_graph'.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cd0634bba0..c5e5a0f860 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir,
>  	oids.nr = 0;
>  }
>  
> +static int check_commit_graph_error;
> +#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); }

Shouldn't 'do { ... } while(0);' trick be used here, like e.g. for
trace_performance macro?

> +
>  int check_commit_graph(struct commit_graph *g)
>  {
> -	return !g;
> +	if (!g) {
> +		graph_report(_("no commit-graph file loaded"));
> +		return 1;
> +	}
> +
> +	check_commit_graph_error = 0;
> +

The load_commit_graph_one() function does its own checks, some of whose
are present below, and some of whose are missing.

If it is used, then why duplicate tests - you would not get here as you
would die earlier.

If it is not used, then some tests are missing.

> +	if (get_be32(g->data) != GRAPH_SIGNATURE)
> +		graph_report(_("commit-graph file has incorrect header"));

The load_commit_graph_one() shows more detailed information:

                     "graph signature %X does not match signature %X",
		      graph_signature, GRAPH_SIGNATURE)

Also, load_commit_graph_one() checks that the file is not too short, and
we actually can access whole header.

> +
> +	if (*(g->data + 4) != 1)
> +		graph_report(_("commit-graph file version is not 1"));

Again:

                     "graph version %X does not match version %X",
		      graph_version, GRAPH_VERSION

Also, here we hardcode the commit-graph file version to 1.

Accidentally, don't we offer backward compatibility, in that if git can
read commit-graph file version 2, it can also read commit-graph file
version 1?

> +	if (*(g->data + 5) != GRAPH_OID_VERSION)
> +		graph_report(_("commit-graph OID version is not 1 (SHA1)"));

In one part we use symbolic constant, on the other hardcoded values.  If
GRAPH_OID_VERSION changes, what then?

Also:

                     "hash version %X does not match version %X",
		      hash_version, GRAPH_OID_VERSION

> +
> +	if (!g->chunk_oid_fanout)
> +		graph_report(_("commit-graph is missing the OID Fanout chunk"));
> +	if (!g->chunk_oid_lookup)
> +		graph_report(_("commit-graph is missing the OID Lookup chunk"));
> +	if (!g->chunk_commit_data)
> +		graph_report(_("commit-graph is missing the Commit Data chunk"));

All right.

> +	if (g->hash_len != GRAPH_OID_LEN)
> +		graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len);

We could be more detailed in error report: what hash length should be,
then?

> +
> +	return check_commit_graph_error;
>  }

No tests of malformed commit-graph file?

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 04/12] commit-graph: parse commit from chosen graph
  2018-04-17 18:10 ` [RFC PATCH 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-04-19 17:21   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-19 17:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> Before checking a commit-graph file against the object database, we

Actually there is quite a few checks more that can be done without
accessing the object database... I'll take a look at later commits why
this one is that relatively early in the series.

> need to parse all commits from the given commit-graph file. Create
> parse_commit_in_graph_one() to target a given struct commit_graph.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index c5e5a0f860..6d0d303a7a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	}
>  }
>  
> -int parse_commit_in_graph(struct commit *item)
> +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
>  {
>  	uint32_t pos;
>  
>  	if (item->object.parsed)
> -		return 0;
> +		return 1;

I am confused and befuddled by those apparent changes between returning
0 or returning 1 when object was parsed.

> +
> +	if (find_commit_in_graph(item, g, &pos))
> +		return fill_commit_in_graph(item, g, pos);
> +
> +	return 0;
> +}
> +
> +int parse_commit_in_graph(struct commit *item)
> +{
>  	if (!core_commit_graph)
>  		return 0;
> +
>  	prepare_commit_graph();
> -	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> -		return fill_commit_in_graph(item, commit_graph, pos);
> +	if (commit_graph)
> +		return parse_commit_in_graph_one(commit_graph, item);
>  	return 0;
>  }

Seems all right.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 05/12] commit-graph: check fanout and lookup table
  2018-04-17 18:10 ` [RFC PATCH 05/12] commit-graph: check fanout and lookup table Derrick Stolee
@ 2018-04-20  7:27   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20  7:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> While running 'git commit-graph check', verify that the object IDs
> are listed in lexicographic order and that the fanout table correctly
> navigates into that list of object IDs.

All right.  I think we can also sanity check the fanout table (for
example that it has 256 elements), see below.

>
> In anticipation of checking the commits in the commit-graph file
> against the object database, parse the commits from that file in
> advance. We perform this parse now to ensure the object cache contains
> only commits from this commit-graph file.

I guess this part could be a separate commit (a separate patch), because
it is not connected to the earlier part.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)

No tests that it detects broken commit-graph file (e.g. one that is
truncated)?

>
> diff --git a/commit-graph.c b/commit-graph.c
> index 6d0d303a7a..6e3c08cd5c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -835,6 +835,9 @@ static int check_commit_graph_error;
>  
>  int check_commit_graph(struct commit_graph *g)
>  {
> +	uint32_t i, cur_fanout_pos = 0;
> +	struct object_id prev_oid, cur_oid;
> +
>  	if (!g) {
>  		graph_report(_("no commit-graph file loaded"));
>  		return 1;
> @@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g)
>  	if (g->hash_len != GRAPH_OID_LEN)
>  		graph_report(_("commit-graph has incorrect hash length: %d"), g->hash_len);
>  
> +	for (i = 0; i < g->num_commits; i++) {
> +		struct commit *graph_commit;
> +
> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> +
> +		if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
> +			graph_report(_("commit-graph has incorrect oid order: %s then %s"),

Good.  Reporting what problem is; we could have also reported the
position at which there is this problem.

> +
> +		oid_to_hex(&prev_oid),
> +		oid_to_hex(&cur_oid));
> +		oidcpy(&prev_oid, &cur_oid);
> +
> +		while (cur_oid.hash[0] > cur_fanout_pos) {
> +			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
> +			if (i != fanout_value)
> +				graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),

Good.  Reporting details of the problem.

> +					     cur_fanout_pos, fanout_value, i);
> +
> +			cur_fanout_pos++;
> +		}

One thing you don't check here is that fanout is closed, that is all the
rest of fanout data up to 256th element (if they are any) all points at
the same position past the last element of OID Lookup chunk.

> +
> +		graph_commit = lookup_commit(&cur_oid);
> +
> +		if (!parse_commit_in_graph_one(g, graph_commit))
> +			graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(&cur_oid));

Doesn't whis check Commit Data (CDAT) chunk, and therefore should better
be in a separate commit?

> +
> +		if (graph_commit->graph_pos != i)
> +			graph_report(_("graph_pos for commit %s is %u != %u"), oid_to_hex(&cur_oid),
> +				     graph_commit->graph_pos, i);

Hmmm... it seems to me that the above does not check that commit-graph
file is correct, but that the parsing code is correct.

> +	}
> +
>  	return check_commit_graph_error;
>  }

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 06/12] commit: force commit to parse from object database
  2018-04-17 18:10 ` [RFC PATCH 06/12] commit: force commit to parse from object database Derrick Stolee
@ 2018-04-20 12:13   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 12:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> In anticipation of checking commit-graph file contents against the
> object database, create parse_commit_internal() to allow side-stepping
> the commit-graph file and parse directly from the object database.

Nitpick/Bikeshed painting: do we have any naming convention for such
functions (*_internal() here)?

>
> Due to the use of generation numbers, this method should not be called
> unless the intention is explicit in avoiding commits from the
> commit-graph file.

Looks good to me, except for some stray whitespace changes in the
patch.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 14 ++++++++++----
>  commit.h |  1 +
>  2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/commit.c b/commit.c
> index 9ef6f699bd..07752d8503 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>  	return 0;
>  }
>  
> -int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +

Stray empty line, though I think it may improve readability of the code
by using two empty lines between separate functions.

But to be consistent with the rest of the file, there shouldn't be this
extra empty line.

> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
>  {
>  	enum object_type type;
>  	void *buffer;
> @@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (parse_commit_in_graph(item))
> +	if (use_commit_graph && parse_commit_in_graph(item))
>  		return 0;

All right.

>  	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
>  	if (!buffer)
>  		return quiet_on_missing ? -1 :
>  			error("Could not read %s",
> -			     oid_to_hex(&item->object.oid));
> +					oid_to_hex(&item->object.oid));

Stray whitespace change (looks like spaces to tabs conversion).

>  	if (type != OBJ_COMMIT) {
>  		free(buffer);
>  		return error("Object %s not a commit",
> -			     oid_to_hex(&item->object.oid));
> +				oid_to_hex(&item->object.oid));

Stray whitespace change (looks like spaces to tabs conversion).

>  	}
>  	ret = parse_commit_buffer(item, buffer, size, 0);
>  	if (save_commit_buffer && !ret) {
> @@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  	return ret;
>  }
>  
> +int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +{
> +	return parse_commit_internal(item, quiet_on_missing, 1);
> +}

All right; if it is internal details of implementations, I don't mind
this slightly cryptic "1" as the value of last parameters.

> +
>  void parse_commit_or_die(struct commit *item)
>  {
>  	if (parse_commit(item))
> diff --git a/commit.h b/commit.h
> index b5afde1ae9..5fde74fcd7 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>  
>  int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
>  int parse_commit_gently(struct commit *item, int quiet_on_missing);
>  static inline int parse_commit(struct commit *item)
>  {

All right.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 07/12] commit-graph: load a root tree from specific graph
  2018-04-17 18:10 ` [RFC PATCH 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-04-20 12:18   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 12:18 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> When lazy-loading a tree for a commit, it will be important to select
> the tree from a specific struct commit_graph. Create a new method that
> specifies the commit-graph file and use that in
> get_commit_tree_in_graph().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Looks good to me.

> ---
>  commit-graph.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
[...]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 08/12] commit-graph: verify commit contents against odb
  2018-04-17 18:10 ` [RFC PATCH 08/12] commit-graph: verify commit contents against odb Derrick Stolee
@ 2018-04-20 16:47   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 16:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

One more check that could been done, and which do not require accessing
the object database, would be testing correctness of the Large Edge List
(EDGE) chunk.

For each commit in the commit-graph (in the Commit Data (CDAT) chunk),
check if it has more than two parents (if the value for second parent is
different from 0xffffffff but has the most significant bit set).  If
there is any such commit, then.

1. Check that EDGE chunk exists
2. For each octopus merge:
   - check that the index into EDGE array is less than number of its
     elements (sanity check the index)
   - for each parent in the EDGE list, check if the position is valid:
     is less than number of commits in the commit-graph
   - check that list of parents in EDGE terminates
3. If feasible, also check that
   - all entries in EDGE chunk are referenced directly or indirectly
   - number of parent list terminators (with most significant bit set)
     is equal to the number of octopus merges (no overlap of parent
     lists) -- if it is considered an error

> When running 'git commit-graph check', compare the contents of the
> commits that are loaded from the commit-graph file with commits that are
> loaded directly from the object database. This includes checking the
> root tree object ID, commit date, and parents.

All right, this part requires checking the object database.

>
> In addition, verify the generation number calculation is correct for all
> commits in the commit-graph file.

But if I understand the code correctly, this one does not require
accessing the object database.  This is entirely separate check, and in
my opinion it should be a separate commit (a separate patch).

Also, there might be a problem with handling GENERATION_NUMBER_MAX.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)

I guess testing for this would be hard - you would need to create
invalid commit-graph file...

>
> diff --git a/commit-graph.c b/commit-graph.c
> index 80a2ac2a6a..b5bce2bac4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g)
>  				     graph_commit->graph_pos, i);
>  	}
>  
> +	for (i = 0; i < g->num_commits; i++) {
> +		struct commit *graph_commit, *odb_commit;
> +		struct commit_list *graph_parents, *odb_parents;
> +		int num_parents = 0;
> +
> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> +
> +		graph_commit = lookup_commit(&cur_oid);
> +		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
> +		if (parse_commit_internal(odb_commit, 0, 0))
> +			graph_report(_("failed to parse %s from object database"), oid_to_hex(&cur_oid));

Is it an error to have commit in the commit-graph that is not present in
the object database?

It may happen (if commit-graph is not automatically updated on gc and
pruning), that since creating the commit-graph, the user have deleted
the branch and pruned object database -- then commit-graph can contain
objects that are not present in the object database.

> +
> +		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
> +			   get_commit_tree_oid(odb_commit)))
> +			graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"),
> +				     oid_to_hex(&cur_oid),
> +				     oid_to_hex(get_commit_tree_oid(graph_commit)),
> +				     oid_to_hex(get_commit_tree_oid(odb_commit)));

Looks good to me, nicely detailed error message.

> +
> +		if (graph_commit->date != odb_commit->date)
> +			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""),
> +				     oid_to_hex(&cur_oid),
> +				     graph_commit->date,
> +				     odb_commit->date);

Looks good to me, nicely detailed error message.
It is good to have the same format of the message.

> +
> +
> +		graph_parents = graph_commit->parents;
> +		odb_parents = odb_commit->parents;
> +
> +		while (graph_parents) {
> +			num_parents++;
> +
> +			if (odb_parents == NULL)
> +				graph_report(_("commit-graph parent list for commit %s is too long (%d)"),
> +					     oid_to_hex(&cur_oid),
> +					     num_parents);
> +
> +			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
> +				graph_report(_("commit-graph parent for %s is %s != %s"),
> +					     oid_to_hex(&cur_oid),
> +					     oid_to_hex(&graph_parents->item->object.oid),
> +					     oid_to_hex(&odb_parents->item->object.oid));
> +
> +			graph_parents = graph_parents->next;
> +			odb_parents = odb_parents->next;
> +		}
> +
> +		if (odb_parents != NULL)
> +			graph_report(_("commit-graph parent list for commit %s terminates early"),
> +				     oid_to_hex(&cur_oid));

All right.  We check that there are no parents in odb that are not
present in the commit-graph, and vice-versa (that there are no parents
in commit-graph that are not present in odb), and we check that all
parents match.

Looks good to me, nicely detailed error message.
It is good to have the same format of the message.

> +
> +		if (graph_commit->generation) {

All right, here we know that the generation number for the commit is not
GENERATION_NUMBER_ZERO because we checked for it, and we know that it is
not GENERATION_NUMBER_INFINITY because the commit was in commit-graph.

> +			uint32_t max_generation = 0;
> +			graph_parents = graph_commit->parents;
> +
> +			while (graph_parents) {
> +				if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
> +				    graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
> +					graph_report(_("commit-graph has valid generation for %s but not its parent, %s"),

What about the case of GENERATION_NUMBER_MAX?  It is an error if parent
has GENERATION_NUMBER_MAX and the commit does not, but it would be
caught by the following check, if with the less helpful error message.

But if commit has GENERATION_NUMBER_MAX, then its parents can have also
GENERATION_NUMBER_MAX, and the following check would fail event if it is
valid, isn't it?

> +						     oid_to_hex(&cur_oid),
> +						     oid_to_hex(&graph_parents->item->object.oid));
> +				if (graph_parents->item->generation > max_generation)
> +					max_generation = graph_parents->item->generation;
> +				graph_parents = graph_parents->next;
> +			}
> +
> +			if (graph_commit->generation != max_generation + 1)
> +				graph_report(_("commit-graph has incorrect generation for %s"),
> +					     oid_to_hex(&cur_oid));

I wonder if we might have to treat the case of parent-less commits (root
commits) special, but I guess that would complicate the code for no
reason.

Though perhaps adding " (root commit)" suffix would be a good idea.
Still complicates code, though.

> +		} else {
> +			graph_parents = graph_commit->parents;
> +
> +			while (graph_parents) {
> +				if (graph_parents->item->generation)
> +					graph_report(_("commit-graph has generation ZERO for %s but not its parent, %s"),

Good check.

> +						     oid_to_hex(&cur_oid),
> +						     oid_to_hex(&graph_parents->item->object.oid));
> +				graph_parents = graph_parents->next;
> +			}
> +		}
> +	}
> +
>  	return check_commit_graph_error;
>  }

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 09/12] fsck: check commit-graph
  2018-04-17 18:10 ` [RFC PATCH 09/12] fsck: check commit-graph Derrick Stolee
@ 2018-04-20 16:59   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 16:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> If a commit-graph file exists, check its contents during 'git fsck'.

Is it "if a commit-graph file exists", or is it core.commitGraph feature
is turned on?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/fsck.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index ef78c6c00c..9712f230ba 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -16,6 +16,7 @@
>  #include "streaming.h"
>  #include "decorate.h"
>  #include "packfile.h"
> +#include "run-command.h"

Couln't this be done internally, without run-command?  Or is it just
preliminary implementation?

>  
>  #define REACHABLE 0x0001
>  #define SEEN      0x0002
> @@ -45,6 +46,7 @@ static int name_objects;
>  #define ERROR_REACHABLE 02
>  #define ERROR_PACK 04
>  #define ERROR_REFS 010
> +#define ERROR_COMMIT_GRAPH 020

I see that these error status codes are not documented anywhere.  Still,
I would expect at least mentioning commit-graph in the git-fsck manpage.

>  
>  static const char *describe_object(struct object *obj)
>  {
> @@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	check_connectivity();
> +
> +	if (core_commit_graph) {
> +		struct child_process commit_graph_check = CHILD_PROCESS_INIT;
> +		const char *check_argv[] = { "commit-graph", "check", NULL, NULL };
> +		commit_graph_check.argv = check_argv;
> +		commit_graph_check.git_cmd = 1;
> +
> +		if (run_command(&commit_graph_check))
> +			errors_found |= ERROR_COMMIT_GRAPH;
> +	}
> +
>  	return errors_found;
>  }

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 10/12] commit-graph: add '--reachable' option
  2018-04-17 18:10 ` [RFC PATCH 10/12] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-04-20 17:17   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 17:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> When writing commit-graph files, it can be convenient to ask for all
> reachable commits (starting at the ref set) in the resulting file. This
> is particularly helpful when writing to stdin is complicated, such as a
> future integration with 'git gc' which will call
> 'git commit-graph write --reachable' after performing cleanup of the
> object database.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

For what it is worth, it looks good to me.

> ---
>  Documentation/git-commit-graph.txt |  8 ++++--
>  builtin/commit-graph.c             | 41 +++++++++++++++++++++++++++---
>  t/t5318-commit-graph.sh            | 10 ++++++++
>  3 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 93c7841ba2..1b14d40590 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in packfiles.
>  +
>  With the `--stdin-packs` option, generate the new commit graph by
>  walking objects only in the specified pack-indexes. (Cannot be combined
> -with --stdin-commits.)
> +with --stdin-commits or --reachable.)
>  +
>  With the `--stdin-commits` option, generate the new commit graph by
>  walking commits starting at the commits specified in stdin as a list
>  of OIDs in hex, one OID per line. (Cannot be combined with
> ---stdin-packs.)
> +--stdin-packs or --reachable.)
> ++
> +With the `--reachable` option, generate the new commit graph by walking
> +commits starting at all refs. (Cannot be combined with --stdin-commits
> +or --stind-packs.)

I wonder if this "cannot be combined" is sustainable... ;-)


[...]
> @@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
>  	return 0;
>  }
>  
> +struct hex_list {
> +	char **hex_strs;
> +	int hex_nr;
> +	int hex_alloc;
> +};
> +
> +static int add_ref_to_list(const char *refname,
> +			   const struct object_id *oid,
> +			   int flags, void *cb_data)
> +{
> +	struct hex_list *list = (struct hex_list*)cb_data;
> +
> +	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
> +	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
> +	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
> +	list->hex_nr++;
> +	return 0;
> +}
> +
>  static int graph_write(int argc, const char **argv)
>  {
>  	const char **pack_indexes = NULL;
> @@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
>  		OPT_STRING(0, "object-dir", &opts.obj_dir,
>  			N_("dir"),
>  			N_("The object directory to store the graph")),
> +		OPT_BOOL(0, "reachable", &opts.reachable,
> +			N_("start walk at all refs")),
>  		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
>  			N_("scan pack-indexes listed by stdin for commits")),
>  		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
> @@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
>  			     builtin_commit_graph_write_options,
>  			     builtin_commit_graph_write_usage, 0);
>  
> -	if (opts.stdin_packs && opts.stdin_commits)
> -		die(_("cannot use both --stdin-commits and --stdin-packs"));
> +	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
> +		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));

Nice trick, but is it worth it (in place of boolean expression)?  Though
it lines up with the error message, though...

>  	if (!opts.obj_dir)
>  		opts.obj_dir = get_object_directory();
>  
> @@ -164,6 +187,16 @@ static int graph_write(int argc, const char **argv)
>  			commit_hex = lines;
>  			commits_nr = lines_nr;
>  		}
> +	} else if (opts.reachable) {
> +		struct hex_list list;
> +		list.hex_nr = 0;
> +		list.hex_alloc = 128;
> +		ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
> +
> +		for_each_ref(add_ref_to_list, &list);
> +
> +		commit_hex = (const char **)list.hex_strs;
> +		commits_nr = list.hex_nr;

Nice trick!

>  	}
>  
>  	write_commit_graph(opts.obj_dir,
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index e91053271a..ccadd22f57 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -200,6 +200,16 @@ test_expect_success 'build graph from commits with append' '
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
>  graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
>  
> +test_expect_success 'build graph using --reachable' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph write --reachable &&
> +	test_path_is_file $objdir/info/commit-graph &&
> +	graph_read_expect "11" "large_edges"
> +'
> +
> +graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> +graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
> +
>  test_expect_success 'setup bare repo' '
>  	cd "$TRASH_DIRECTORY" &&
>  	git clone --bare --no-local full bare &&

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 11/12] gc: automatically write commit-graph files
  2018-04-17 18:10 ` [RFC PATCH 11/12] gc: automatically write commit-graph files Derrick Stolee
@ 2018-04-20 17:34   ` Jakub Narebski
  2018-04-20 18:33     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 17:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file is a very helpful feature for speeding up git
> operations. In order to make it more useful, write the commit-graph file
> by default during standard garbage collection operations.
>
> Add a 'gc.commitGraph' config setting that triggers writing a
> commit-graph file after any non-trivial 'git gc' command.

Other than the question if 'gc.commitGraph' and 'core.commitGraph'
should be independent config variables, and the exact wording of the
git-gc docs, it looks good to me.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-gc.txt | 4 ++++
>  builtin/gc.c             | 8 ++++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index 571b5a7e3c..17dd654a59 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
>  it within all non-bare repos or it can be set to a boolean value.
>  This defaults to true.
>  
> +The optional configuration variable 'gc.commitGraph' determines if
> +'git gc' runs 'git commit-graph write'. This can be set to a boolean
> +value. This defaults to false.
> +
>  The optional configuration variable `gc.aggressiveWindow` controls how
>  much time is spent optimizing the delta compression of the objects in
>  the repository when the --aggressive option is specified.  The larger

[...]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 11/12] gc: automatically write commit-graph files
  2018-04-20 17:34   ` Jakub Narebski
@ 2018-04-20 18:33     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 149+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-04-20 18:33 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Derrick Stolee, git, Jeff King, Stefan Beller


On Fri, Apr 20 2018, Jakub Narebski wrote:

> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The commit-graph file is a very helpful feature for speeding up git
>> operations. In order to make it more useful, write the commit-graph file
>> by default during standard garbage collection operations.
>>
>> Add a 'gc.commitGraph' config setting that triggers writing a
>> commit-graph file after any non-trivial 'git gc' command.
>
> Other than the question if 'gc.commitGraph' and 'core.commitGraph'
> should be independent config variables, and the exact wording of the
> git-gc docs, it looks good to me.

Sans doc errors you pointed out in other places (you need to set
core.commitGraph so it's read at all), I think it's very useful to have
these split up. It's simliar to pack.useBitmaps & pack.writeBitmaps.

Makes it easy to start writing them, and then have a quick toggle to
turn it off if there's any issues rathen than go around deleting the
files or making sure they're not written out.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [RFC PATCH 12/12] commit-graph: update design document
  2018-04-17 18:10 ` [RFC PATCH 12/12] commit-graph: update design document Derrick Stolee
@ 2018-04-20 19:10   ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-04-20 19:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Stefan Beller

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph feature is now integrated with 'fsck' and 'gc',
> so remove those items from the "Future Work" section of the
> commit-graph design document.

See comments below, but this looks good to me.

What is missing from the "Future Work" section (and which was somewhat
implied by now removed "When this feature stabilizes enough to recommend
to most users") is safety against history [view] changing features:
git-replace, shallow clone and grafts file.  I wrote about this in
"Re: [PATCH v8 04/14] graph: add commit graph design document"
https://public-inbox.org/git/86vacsjdcg.fsf@gmail.com/

JN> The problem in my opinion lies in different direction, namely that
JN> commit grafts can change, changing the view of the history.  If we want
JN> commit-graph file to follow user-visible view of the history of the
JN> project, it needs to respect current version of commit grafts - but what
JN> if commit grafts changed since commit-graph file was generated?
JN> 
JN> Actually, there are currently three ways to affect the view of the
JN> history:
JN> 
JN> a. legacy commit grafts mechanism; it was first, but it is not safe,
JN>    cannot be transferred on fetch / push, and is now deprecated.
JN> 
JN> b. shallow clones, which are kind of specialized and limited grafts;
JN>    they used to limit available functionality, but restrictions are
JN>    being lifted (or perhaps even got lifted)
JN> 
JN> c. git-replace mechanism, where we can create an "overlay" of any
JN>    object, and is intended to be among others replacement for commit
JN>    grafts; safe, transferable, can be turned off with "git
JN>    --no-replace-objects <command>"
JN> 
JN> All those can change; some more likely than others.  The problem is if
JN> they change between writing commit-graph file (respecting old view of
JN> the history) and reading it (where we expect to see the new view).
JN> 
JN> a. grafts file can change: lines can be added, removed or changed
JN> 
JN> b. shallow clones can be deepened or shortened, or even make
JN>    not shallow
JN> 
JN> c. new replacements can be added, old removed, and existing edited
JN> 
JN> 
JN> There are, as far as I can see, two ways of handling the issue of Git
JN> features that can change the view of the project's history, namely:
JN> 
JN>  * Disable commit-graph reading when any of this features are used, and
JN>    always write full graph info.
JN> 
JN>    This may not matter much for shallow clones, where commit count
JN>    should be small anyway, but when using git-replace to stitch together
JN>    current repository with historical one, commit-graph would be
JN>    certainly useful.  Also, git-replace does not need to change history.
JN> 
JN>    On the other hand I think it is the easier solution.
JN> 
JN> Or
JN> 
JN>  * Detect somehow that the view of the history changed, and invalidate
JN>    commit-graph (perhaps even automatically regenerate it).
JN> 
JN>    For shallow clone changes I think one can still use the old
JN>    commit-graph file to generate the new one.  For other cases, the
JN>    metadata is simple to modify, but indices such as generation number
JN>    would need to be at least partially calculated anew.

Note that in all cases one can simply discard generation number
information (treating it as if it was ZERO), and use commit-graph as
values before applying history-changing feature: replacements, grafts or
shallow.

Well, at least for shallow you can do that for generation numbers: using
grafts (deprecated) or replacements to join repository with historical
one would mean that we are no longer have commit-graph transitively
closed under reachability.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 9 ---------
>  1 file changed, 9 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index d9f2713efa..d04657b781 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -118,9 +118,6 @@ Future Work
>  - The commit graph feature currently does not honor commit grafts. This can
>    be remedied by duplicating or refactoring the current graft logic.
>  
> -- The 'commit-graph' subcommand does not have a "verify" mode that is
> -  necessary for integration with fsck.
> -

All right (though "verify" mode is actually done via "check" command).

>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
> @@ -142,12 +139,6 @@ Future Work
>    such as "ensure_tree_loaded(commit)" that fully loads a tree before
>    using commit->tree.
>  
> -- The current design uses the 'commit-graph' subcommand to generate the graph.
> -  When this feature stabilizes enough to recommend to most users, we should
> -  add automatic graph writes to common operations that create many commits.
> -  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
> -  commands.
> -

I'm not sure if this paragraph should be deleted as a whole; it depends
on whether we decide that using git-gc to do automatic graph writes is
enough (git-gc should be ran automatically by git if we get many new
objects anyway, so this gives us almost "compute a graph on 'clone',
'fetch', or 'repack'").

>  - A server could provide a commit graph file as part of the network protocol
>    to avoid extra calculations by clients. This feature is only of benefit if
>    the user is willing to trust the file, because verifying the file is correct

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                   ` (12 preceding siblings ...)
  2018-04-17 18:50 ` [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
@ 2018-05-10 17:34 ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
                     ` (14 more replies)
  13 siblings, 15 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Thanks, Jakub, for the feedback on the RFC. I think there are still some
things to decide at a high-level before we dig too far into the review.
Specifically, the integration points in 'fsck', 'gc', and 'fetch' are
worth considering our alternatives.

For 'fsck', the current integration is minimal: if core.commitGraph is
true, then 'git fsck' calls 'git commit-graph verify --object-dir=X'
for the objects directory and every alternate. There are a few options
to consider here:

1. Keep this behavior: we should always check the commit-graph if it
   exists.

2. Add a --[no-]commit-graph argument to 'fsck' that toggles the
   commit-graph verification.

3. Remove all direct integration between 'fsck' and 'commit-graph' and
   instead rely on users checking 'git commit-graph verify' manually
   when they suspect a problem with the commit-graph file. While this
   option is worth considering, it is my least favorite since it requires
   more from users.

For 'gc' and 'fetch', these seemed like natural places to update the
commit-graph file. Relative to the other maintenance that occurs during
these commands, the 'git commit-graph write' command is fast, especially
for incremental updates; only the "new" commits are walked when computing
generation numbers and other metadata for the commit-graph file.

The behavior in this patch series does the following:

1. Near the end of 'git gc', run 'git commit-graph write'. The location
   of this code assumes that a 'git gc --auto' has not terminated early
   due to not meeting the auto threshold.

2. At the end of 'git fetch', run 'git commit-graph write'. This means
   that every reachable commit will be in the commit-graph after a
   a successful fetch, which seems a reasonable frequency. Then, the
   only times we would be missing a reachable commit is after creating
   one locally. There is a problem with the current patch, though: every
   'git fetch' call runs 'git commit-graph write', even if there were no
   ref updates or objects downloaded. Is there a simple way to detect if
   the fetch was non-trivial?

One obvious problem with this approach: if we compute this during 'gc'
AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
two commit-graph writes. If I were to abandon one of these patches, it
would be the 'fetch' integration. A 'git gc' really wants to delete all
references to unreachable commits, and without updating the commit-graph
we may still have commit data in the commit-graph file that is not in
the object database. In fact, deleting commits from the object database
but not from the commit-graph will cause 'git commit-graph verify' to
fail!

I welcome discussion on these ideas, as we are venturing out of the
"pure data structure" world and into the "user experience" world. I am
less confident in my skills in this world, but the feature is worthless
if it does not improve the user experience.

Thanks,
-Stolee

Derrick Stolee (12):

Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
are ready for full, rigorous review.

  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb

Commit 08 integrates 'git commit-graph verify' into fsck. The work here
is the minimum integration possible. (See above discussion of options.)

  fsck: verify commit-graph

Commit 09 introduces a new '--reachable' option only to make the calls
from 'gc' and 'fetch' simpler. Commits 10-11 integrate writing the
commit-graph into 'gc' and 'fetch', respectively. (See above disucssion.)

  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  fetch: compute commit-graph by default

Commit 12 simply deletes sections from the "Future Work" section
of the commit-graph design document.

  commit-graph: update design document

 Documentation/config.txt                 |  10 ++
 Documentation/git-commit-graph.txt       |  14 ++-
 Documentation/git-fsck.txt               |   3 +
 Documentation/git-gc.txt                 |   4 +
 Documentation/technical/commit-graph.txt |  22 ----
 builtin/commit-graph.c                   |  79 +++++++++++++-
 builtin/fetch.c                          |  13 +++
 builtin/fsck.c                           |  21 ++++
 builtin/gc.c                             |   8 ++
 commit-graph.c                           | 175 ++++++++++++++++++++++++++++++-
 commit-graph.h                           |   2 +
 commit.c                                 |  13 ++-
 commit.h                                 |   1 +
 t/t5318-commit-graph.sh                  |  25 +++++
 14 files changed, 353 insertions(+), 37 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 01/12] commit-graph: add 'verify' subcommand
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 18:15     ` Martin Ågren
  2018-05-10 17:34   ` [PATCH 02/12] commit-graph: verify file header information Derrick Stolee
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

In case the commit-graph file becomes corrupt, we need a way to
verify its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  6 ++++++
 builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++++++++++
 commit-graph.c                     |  5 +++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            | 10 ++++++++++
 5 files changed, 61 insertions(+)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 4c97b555cc..1daefa7fb1 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph read' [--object-dir <dir>]
+'git commit-graph verify' [--object-dir <dir>]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to verify for corrupted data.
+
 
 EXAMPLES
 --------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..f5d891f2b8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+	N_("git commit-graph verify [--object-dir <objdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
 	int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+	struct commit_graph *graph = 0;
+	char *graph_name;
+
+	static struct option builtin_commit_graph_verify_options[] = {
+		OPT_STRING(0, "object-dir", &opts.obj_dir,
+			   N_("dir"),
+			   N_("The object directory to store the graph")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_verify_options,
+			     builtin_commit_graph_verify_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	graph_name = get_commit_graph_filename(opts.obj_dir);
+	graph = load_commit_graph_one(graph_name);
+
+	if (!graph)
+		return 0;
+	FREE_AND_NULL(graph_name);
+
+	return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "verify"))
+			return graph_verify(argc, argv);
 		if (!strcmp(argv[0], "read"))
 			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..b25aaed128 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir,
 	oids.alloc = 0;
 	oids.nr = 0;
 }
+
+int verify_commit_graph(struct commit_graph *g)
+{
+	return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..71a39c5a57 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
 			int nr_commits,
 			int append);
 
+int verify_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..6ca451dfd2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
 	objdir=".git/objects"
 '
 
+test_expect_success 'verify graph with no graph file' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify
+'
+
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
@@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+test_expect_success 'git commit-graph verify' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify >output
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 02/12] commit-graph: verify file header information
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
  2018-05-10 17:34   ` [PATCH 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 18:21     ` Martin Ågren
  2018-05-10 17:34   ` [PATCH 03/12] commit-graph: parse commit from chosen graph Derrick Stolee
                     ` (12 subsequent siblings)
  14 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

During a run of 'git commit-graph verify', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'. Some header
information is checked as part of load_commit_graph_one().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b25aaed128..c3b8716c14 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -818,7 +818,28 @@ void write_commit_graph(const char *obj_dir,
 	oids.nr = 0;
 }
 
+static int verify_commit_graph_error;
+#define graph_report(...) \
+	do {\
+		verify_commit_graph_error = 1;\
+		printf(__VA_ARGS__);\
+	} while (0);
+
 int verify_commit_graph(struct commit_graph *g)
 {
-	return !g;
+	if (!g) {
+		graph_report(_("no commit-graph file loaded"));
+		return 1;
+	}
+
+	verify_commit_graph_error = 0;
+
+	if (!g->chunk_oid_fanout)
+		graph_report(_("commit-graph is missing the OID Fanout chunk"));
+	if (!g->chunk_oid_lookup)
+		graph_report(_("commit-graph is missing the OID Lookup chunk"));
+	if (!g->chunk_commit_data)
+		graph_report(_("commit-graph is missing the Commit Data chunk"));
+
+	return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 03/12] commit-graph: parse commit from chosen graph
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
  2018-05-10 17:34   ` [PATCH 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
  2018-05-10 17:34   ` [PATCH 02/12] commit-graph: verify file header information Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 04/12] commit-graph: verify fanout and lookup table Derrick Stolee
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c3b8716c14..ce11af1d20 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
 	uint32_t pos;
 
@@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item)
 		return 0;
 	if (item->object.parsed)
 		return 1;
+
+	if (find_commit_in_graph(item, g, &pos))
+		return fill_commit_in_graph(item, g, pos);
+
+	return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commit_graph)
+		return 0;
+
 	prepare_commit_graph();
-	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-		return fill_commit_in_graph(item, commit_graph, pos);
+	if (commit_graph)
+		return parse_commit_in_graph_one(commit_graph, item);
 	return 0;
 }
 
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 04/12] commit-graph: verify fanout and lookup table
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (2 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 03/12] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 18:29     ` Martin Ågren
  2018-05-10 17:34   ` [PATCH 05/12] commit: force commit to parse from object database Derrick Stolee
                     ` (10 subsequent siblings)
  14 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index ce11af1d20..b4c146c423 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -839,6 +839,9 @@ static int verify_commit_graph_error;
 
 int verify_commit_graph(struct commit_graph *g)
 {
+	uint32_t i, cur_fanout_pos = 0;
+	struct object_id prev_oid, cur_oid;
+
 	if (!g) {
 		graph_report(_("no commit-graph file loaded"));
 		return 1;
@@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g)
 	if (!g->chunk_commit_data)
 		graph_report(_("commit-graph is missing the Commit Data chunk"));
 
+	for (i = 0; i < g->num_commits; i++) {
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
+			graph_report(_("commit-graph has incorrect oid order: %s then %s"),
+
+		oid_to_hex(&prev_oid),
+		oid_to_hex(&cur_oid));
+		oidcpy(&prev_oid, &cur_oid);
+
+		while (cur_oid.hash[0] > cur_fanout_pos) {
+			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+			if (i != fanout_value)
+				graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
+					     cur_fanout_pos, fanout_value, i);
+
+			cur_fanout_pos++;
+		}
+	}
+
+	while (cur_fanout_pos < 256) {
+		uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+
+		if (g->num_commits != fanout_value)
+			graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
+				     cur_fanout_pos, fanout_value, i);
+
+		cur_fanout_pos++;
+	}
+
 	return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 05/12] commit: force commit to parse from object database
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (3 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 04/12] commit-graph: verify fanout and lookup table Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 06/12] commit-graph: load a root tree from specific graph Derrick Stolee
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Most consumers expect that if a commit exists in the commit-graph, then
the commit was loaded from the commit-graph so values such as generation
numbers are loaded. Hence, this method should not be called unless the
intention is explicit in avoiding commits from the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 13 +++++++++----
 commit.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..7c92350373 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
 {
 	enum object_type type;
 	void *buffer;
@@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (parse_commit_in_graph(item))
+	if (use_commit_graph && parse_commit_in_graph(item))
 		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
 			error("Could not read %s",
-			     oid_to_hex(&item->object.oid));
+					oid_to_hex(&item->object.oid));
 	if (type != OBJ_COMMIT) {
 		free(buffer);
 		return error("Object %s not a commit",
-			     oid_to_hex(&item->object.oid));
+				oid_to_hex(&item->object.oid));
 	}
 	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 	return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+	return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
 	if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 06/12] commit-graph: load a root tree from specific graph
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (4 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 05/12] commit: force commit to parse from object database Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 07/12] commit-graph: verify commit contents against odb Derrick Stolee
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b4c146c423..8c636abba9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *
 	return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+						 const struct commit *c)
 {
 	if (c->maybe_tree)
 		return c->maybe_tree;
 	if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
 		BUG("get_commit_tree_in_graph called from non-commit-graph commit");
 
-	return load_tree_for_commit(commit_graph, (struct commit *)c);
+	return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+	return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 07/12] commit-graph: verify commit contents against odb
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (5 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 06/12] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 08/12] fsck: verify commit-graph Derrick Stolee
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 8c636abba9..24f5031f3e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -863,6 +863,8 @@ int verify_commit_graph(struct commit_graph *g)
 		graph_report(_("commit-graph is missing the Commit Data chunk"));
 
 	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit;
+
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
 		if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -880,6 +882,10 @@ int verify_commit_graph(struct commit_graph *g)
 
 			cur_fanout_pos++;
 		}
+
+		graph_commit = lookup_commit(&cur_oid);
+		if (!parse_commit_in_graph_one(g, graph_commit))
+			graph_report(_("failed to parse %s from commit-graph"), oid_to_hex(&cur_oid));
 	}
 
 	while (cur_fanout_pos < 256) {
@@ -892,5 +898,87 @@ int verify_commit_graph(struct commit_graph *g)
 		cur_fanout_pos++;
 	}
 
+	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit, *odb_commit;
+		struct commit_list *graph_parents, *odb_parents;
+		int num_parents = 0;
+
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		graph_commit = lookup_commit(&cur_oid);
+		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
+		if (parse_commit_internal(odb_commit, 0, 0))
+			graph_report(_("failed to parse %s from object database"), oid_to_hex(&cur_oid));
+
+		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
+			   get_commit_tree_oid(odb_commit)))
+			graph_report(_("root tree object ID for commit %s in commit-graph is %s != %s"),
+				     oid_to_hex(&cur_oid),
+				     oid_to_hex(get_commit_tree_oid(graph_commit)),
+				     oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+		if (graph_commit->date != odb_commit->date)
+			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime""),
+				     oid_to_hex(&cur_oid),
+				     graph_commit->date,
+				     odb_commit->date);
+
+
+		graph_parents = graph_commit->parents;
+		odb_parents = odb_commit->parents;
+
+		while (graph_parents) {
+			num_parents++;
+
+			if (odb_parents == NULL)
+				graph_report(_("commit-graph parent list for commit %s is too long (%d)"),
+					     oid_to_hex(&cur_oid),
+					     num_parents);
+
+			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
+				graph_report(_("commit-graph parent for %s is %s != %s"),
+					     oid_to_hex(&cur_oid),
+					     oid_to_hex(&graph_parents->item->object.oid),
+					     oid_to_hex(&odb_parents->item->object.oid));
+
+			graph_parents = graph_parents->next;
+			odb_parents = odb_parents->next;
+		}
+
+		if (odb_parents != NULL)
+			graph_report(_("commit-graph parent list for commit %s terminates early"),
+				     oid_to_hex(&cur_oid));
+
+		if (graph_commit->generation) {
+			uint32_t max_generation = 0;
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
+				    graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
+					graph_report(_("commit-graph has valid generation for %s but not its parent, %s"),
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				if (graph_parents->item->generation > max_generation)
+					max_generation = graph_parents->item->generation;
+				graph_parents = graph_parents->next;
+			}
+
+			if (graph_commit->generation != max_generation + 1)
+				graph_report(_("commit-graph has incorrect generation for %s"),
+					     oid_to_hex(&cur_oid));
+		} else {
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation)
+					graph_report(_("commit-graph has generation ZERO for %s but not its parent, %s"),
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				graph_parents = graph_parents->next;
+			}
+		}
+	}
+
 	return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 08/12] fsck: verify commit-graph
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (6 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 07/12] commit-graph: verify commit contents against odb Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 09/12] commit-graph: add '--reachable' option Derrick Stolee
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c             | 21 +++++++++++++++++++++
 t/t5318-commit-graph.sh    |  5 +++++
 3 files changed, 29 insertions(+)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 ---------------------
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	}
 
 	check_connectivity();
+
+	if (core_commit_graph) {
+		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL };
+		commit_graph_verify.argv = verify_argv;
+		commit_graph_verify.git_cmd = 1;
+
+		if (run_command(&commit_graph_verify))
+			errors_found |= ERROR_COMMIT_GRAPH;
+
+		prepare_alt_odb();
+		for (alt = alt_odb_list; alt; alt = alt->next) {
+			verify_argv[2] = "--object-dir";
+			verify_argv[3] = alt->path;
+			if (run_command(&commit_graph_verify))
+				errors_found |= ERROR_COMMIT_GRAPH;
+		}
+	}
+
 	return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..9680e6e6e0 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -240,4 +240,9 @@ test_expect_success 'git commit-graph verify' '
 	git commit-graph verify >output
 '
 
+test_expect_success 'git fsck (checks commit-graph)' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git fsck
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 09/12] commit-graph: add '--reachable' option
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (7 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 08/12] fsck: verify commit-graph Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 10/12] gc: automatically write commit-graph files Derrick Stolee
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 ++++++--
 builtin/commit-graph.c             | 41 ++++++++++++++++++++++++++++++++++----
 t/t5318-commit-graph.sh            | 10 ++++++++++
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 1daefa7fb1..fbab3feba1 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f5d891f2b8..4c9328cdf2 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *obj_dir;
+	int reachable;
 	int stdin_packs;
 	int stdin_commits;
 	int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
 	return 0;
 }
 
+struct hex_list {
+	char **hex_strs;
+	int hex_nr;
+	int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+			   const struct object_id *oid,
+			   int flags, void *cb_data)
+{
+	struct hex_list *list = (struct hex_list*)cb_data;
+
+	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+	list->hex_nr++;
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph")),
+		OPT_BOOL(0, "reachable", &opts.reachable,
+			N_("start walk at all refs")),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			N_("scan pack-indexes listed by stdin for commits")),
 		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
-	if (opts.stdin_packs && opts.stdin_commits)
-		die(_("cannot use both --stdin-commits and --stdin-packs"));
+	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
@@ -164,6 +187,16 @@ static int graph_write(int argc, const char **argv)
 			commit_hex = lines;
 			commits_nr = lines_nr;
 		}
+	} else if (opts.reachable) {
+		struct hex_list list;
+		list.hex_nr = 0;
+		list.hex_alloc = 128;
+		ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
+
+		for_each_ref(add_ref_to_list, &list);
+
+		commit_hex = (const char **)list.hex_strs;
+		commits_nr = list.hex_nr;
 	}
 
 	write_commit_graph(opts.obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 9680e6e6e0..82f95eb11f 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'build graph using --reachable' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph write --reachable &&
+	test_path_is_file $objdir/info/commit-graph &&
+	graph_read_expect "11" "large_edges"
+'
+
+graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd "$TRASH_DIRECTORY" &&
 	git clone --bare --no-local full bare &&
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 10/12] gc: automatically write commit-graph files
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (8 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 09/12] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 11/12] fetch: compute commit-graph by default Derrick Stolee
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 6 ++++++
 Documentation/git-gc.txt | 4 ++++
 builtin/gc.c             | 8 ++++++++
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
 	if the system supports it. Default is true.
 
+gc.commitGraph::
+	If true, then gc will rewrite the commit-graph file after any
+	change to the object database. If '--auto' is used, then the
+	commit-graph will not be updated unless the threshold is met.
+	See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
 	If the file gc.log exists, then `git gc --auto` won't run
 	unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..8403445738 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
 	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	argv_array_pushl(&prune, "prune", "--expire", NULL);
 	argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", NULL);
 	argv_array_pushl(&rerere, "rerere", "gc", NULL);
+	argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", NULL);
 
 	/* default expiry time, overwritten in gc_config */
 	gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();
 
+	if (gc_commit_graph)
+		if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+			return error(FAILED_RUN, commit_graph.argv[0]);
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 11/12] fetch: compute commit-graph by default
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (9 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 10/12] gc: automatically write commit-graph files Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 17:34   ` [PATCH 12/12] commit-graph: update design document Derrick Stolee
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

During a call to 'git fetch', we expect new commits and updated refs.
Use these updated refs to add the new commits to the commit-graph file,
automatically providing performance benefits in other calls.

Use 'fetch.commitGraph' config setting to enable or disable this
behavior. Defaults to false while the commit-graph feature matures.
Specifically, we do not want this on by default until the commit-graph
feature integrates with history-modifying features such as shallow
clones.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt |  4 ++++
 builtin/fetch.c          | 13 +++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 9a3abd87e7..3d8225600a 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1409,6 +1409,10 @@ fetch.output::
 	`full` and `compact`. Default value is `full`. See section
 	OUTPUT in linkgit:git-fetch[1] for detail.
 
+fetch.commitGraph::
+	If true, fetch will automatically update the commit-graph file.
+	See linkgit:git-commit-graph[1].
+
 format.attach::
 	Enable multipart/mixed attachments as the default for
 	'format-patch'.  The value can also be a double quoted string
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 8ee998ea2e..254f6ecfb6 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -38,6 +38,7 @@ enum {
 static int fetch_prune_config = -1; /* unspecified */
 static int prune = -1; /* unspecified */
 #define PRUNE_BY_DEFAULT 0 /* do we prune by default? */
+static int fetch_commit_graph = 0;
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok, verbosity, deepen_relative;
 static int progress = -1;
@@ -66,6 +67,11 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.commitGraph")) {
+		fetch_commit_graph = git_config_bool(k, v);
+		return 0;
+	}
+
 	if (!strcmp(k, "submodule.recurse")) {
 		int r = git_config_bool(k, v) ?
 			RECURSE_SUBMODULES_ON : RECURSE_SUBMODULES_OFF;
@@ -1462,6 +1468,13 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 		result = fetch_multiple(&list);
 	}
 
+	if (!result && fetch_commit_graph) {
+		struct argv_array commit_graph = ARGV_ARRAY_INIT;
+		argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", NULL);
+		if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+			result = 1;
+	}
+
 	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
 		struct argv_array options = ARGV_ARRAY_INIT;
 
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH 12/12] commit-graph: update design document
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (10 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 11/12] fetch: compute commit-graph by default Derrick Stolee
@ 2018-05-10 17:34   ` Derrick Stolee
  2018-05-10 19:05   ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Martin Ågren
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-10 17:34 UTC (permalink / raw)
  To: git; +Cc: peff, sbeller, jnareb, stolee, Derrick Stolee

The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
     - 'log --topo-order'
     - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 01/12] commit-graph: add 'verify' subcommand
  2018-05-10 17:34   ` [PATCH 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-10 18:15     ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-10 18:15 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, sbeller, jnareb, stolee

On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
> In case the commit-graph file becomes corrupt, we need a way to
> verify its contents match the object database. In the manner of

s/verify its/verify that its/ might read better.

> 'git fsck' we will implement a 'git commit-graph verify' subcommand
> to report all issues with the file.
>
> Add the 'verify' subcommand to the 'commit-graph' builtin and its
> documentation. The subcommand is currently a no-op except for
> loading the commit-graph into memory, which may trigger run-time
> errors that would be caught by normal use. Add a simple test that
> ensures the command returns a zero error code.
>
> If no commit-graph file exists, this is an acceptable state. Do
> not report any errors.

This all makes sense to me.

> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt

> +'verify'::
> +
> +Read the commit-graph file and verify its contents against the object
> +database. Used to verify for corrupted data.

s/verify for/check for/?

> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -7,11 +7,17 @@
>
>  static char const * const builtin_commit_graph_usage[] = {
>         N_("git commit-graph [--object-dir <objdir>]"),
> +       N_("git commit-graph verify [--object-dir <objdir>]"),
>         N_("git commit-graph read [--object-dir <objdir>]"),
>         N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),

Minor nit: In the man-page, you added verify after read, which makes
more sense I think (r < v < w).

(I also note that the man-page synopsis doesn't give the no-subcommand
usage.)

> +static int graph_verify(int argc, const char **argv)
> +{
> +       struct commit_graph *graph = 0;
> +       char *graph_name;
> +
> +       static struct option builtin_commit_graph_verify_options[] = {
> +               OPT_STRING(0, "object-dir", &opts.obj_dir,
> +                          N_("dir"),
> +                          N_("The object directory to store the graph")),
> +               OPT_END(),
> +       };
> +
> +       argc = parse_options(argc, argv, NULL,
> +                            builtin_commit_graph_verify_options,
> +                            builtin_commit_graph_verify_usage, 0);
> +
> +       if (!opts.obj_dir)
> +               opts.obj_dir = get_object_directory();
> +
> +       graph_name = get_commit_graph_filename(opts.obj_dir);
> +       graph = load_commit_graph_one(graph_name);
> +
> +       if (!graph)
> +               return 0;
> +       FREE_AND_NULL(graph_name);

Maybe the FREE_AND_NULL could go immediately after the call to
`load_commit_graph_one()`. It makes it more obvious that you're done
with the name, and -- perhaps more importantly -- means that throwing a
leak-checker at this won't complain if we take the early return.

> +
> +       return verify_commit_graph(graph);

A leak-checker would still complain about leaking `graph`. I think it
would be ok to just UNLEAK it before calling `verify_commit_graph()`.
This is IMHO close enough to returning from `cmd_commit_graph()` to make
UNLEAK an acceptable, or even the correct, solution.

I realize that `graph_read()` is doing something similar to this patch
already, so what you have here is certainly the most consistent code.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 02/12] commit-graph: verify file header information
  2018-05-10 17:34   ` [PATCH 02/12] commit-graph: verify file header information Derrick Stolee
@ 2018-05-10 18:21     ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-10 18:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, sbeller, jnareb, stolee

On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
> During a run of 'git commit-graph verify', list the issues with the
> header information in the commit-graph file. Some of this information
> is inferred from the loaded 'struct commit_graph'. Some header
> information is checked as part of load_commit_graph_one().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index b25aaed128..c3b8716c14 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -818,7 +818,28 @@ void write_commit_graph(const char *obj_dir,
>         oids.nr = 0;
>  }
>
> +static int verify_commit_graph_error;
> +#define graph_report(...) \
> +       do {\
> +               verify_commit_graph_error = 1;\
> +               printf(__VA_ARGS__);\
> +       } while (0);
> +

It seems to me that other users of __VA_ARGS__ are protected with a
check for HAVE_VARIADIC_MACROS and provide an alternative
non-__VA_ARGS__-implementation. Or maybe I've missed something in my
grepping and we are actually (slowly) moving towards assuming
__VA_ARGS__ is always available?

>  int verify_commit_graph(struct commit_graph *g)
>  {
> -       return !g;
> +       if (!g) {
> +               graph_report(_("no commit-graph file loaded"));
> +               return 1;
> +       }
> +
> +       verify_commit_graph_error = 0;
> +
> +       if (!g->chunk_oid_fanout)
> +               graph_report(_("commit-graph is missing the OID Fanout chunk"));
> +       if (!g->chunk_oid_lookup)
> +               graph_report(_("commit-graph is missing the OID Lookup chunk"));
> +       if (!g->chunk_commit_data)
> +               graph_report(_("commit-graph is missing the Commit Data chunk"));
> +
> +       return verify_commit_graph_error;

If you can't rely on __VA_ARGS__, maybe bite the bullet and introduce
braces... The expanded code wouldn't be too horrible, albeit a bit
repetitive.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/12] commit-graph: verify fanout and lookup table
  2018-05-10 17:34   ` [PATCH 04/12] commit-graph: verify fanout and lookup table Derrick Stolee
@ 2018-05-10 18:29     ` Martin Ågren
  2018-05-11 15:17       ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Martin Ågren @ 2018-05-10 18:29 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, sbeller, jnareb, stolee

On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
> While running 'git commit-graph verify', verify that the object IDs
> are listed in lexicographic order and that the fanout table correctly
> navigates into that list of object IDs.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index ce11af1d20..b4c146c423 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -839,6 +839,9 @@ static int verify_commit_graph_error;
>
>  int verify_commit_graph(struct commit_graph *g)
>  {
> +       uint32_t i, cur_fanout_pos = 0;
> +       struct object_id prev_oid, cur_oid;
> +
>         if (!g) {
>                 graph_report(_("no commit-graph file loaded"));
>                 return 1;
> @@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g)
>         if (!g->chunk_commit_data)
>                 graph_report(_("commit-graph is missing the Commit Data chunk"));
>
> +       for (i = 0; i < g->num_commits; i++) {
> +               hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> +
> +               if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
> +                       graph_report(_("commit-graph has incorrect oid order: %s then %s"),

Minor: I think our style would prefer s/i > 0/i/.

I suppose the second check should be s/>=/>/, but it's not like it
should matter. ;-)

I wonder if this is a message that would virtually never make sense to a
user, but more to a developer. Leave it untranslated to make sure any
bug reports to the list are readable to us?

> +
> +               oid_to_hex(&prev_oid),
> +               oid_to_hex(&cur_oid));

Hmm, these two lines do not actually achieve anything?

> +               oidcpy(&prev_oid, &cur_oid);
> +
> +               while (cur_oid.hash[0] > cur_fanout_pos) {
> +                       uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
> +                       if (i != fanout_value)
> +                               graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
> +                                            cur_fanout_pos, fanout_value, i);

Same though re `_()`, even more so because of the more technical
notation.

> +
> +                       cur_fanout_pos++;
> +               }
> +       }
> +
> +       while (cur_fanout_pos < 256) {
> +               uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
> +
> +               if (g->num_commits != fanout_value)
> +                       graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
> +                                    cur_fanout_pos, fanout_value, i);

Same here. Or maybe these should just give a translated user-readable
basic idea of what is wrong and skip the details?

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (11 preceding siblings ...)
  2018-05-10 17:34   ` [PATCH 12/12] commit-graph: update design document Derrick Stolee
@ 2018-05-10 19:05   ` Martin Ågren
  2018-05-10 19:22     ` Stefan Beller
  2018-05-10 19:17   ` Ævar Arnfjörð Bjarmason
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
  14 siblings, 1 reply; 149+ messages in thread
From: Martin Ågren @ 2018-05-10 19:05 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, peff, sbeller, jnareb, stolee

On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:

> Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
> are ready for full, rigorous review.

I don't know about "full" and "rigorous", but I tried to offer my thoughts.

A lingering feeling I have is that users could possibly benefit from
seeing "the commit-graph has a bad foo" a bit more than just "the
commit-graph is bad". But adding "the bar is baz, should have been
frotz" might not bring that much. Maybe you could keep the translatable
string somewhat simple, then, if the more technical data could be useful
to Git developers, dump it in a non-translated format. (I guess it could
be hidden behind a debug switch, but let's take one step at a time..)
This is nothing I feel strongly about.

>  t/t5318-commit-graph.sh                  |  25 +++++

I wonder about tests. Some tests seem to use `dd` to corrupt files and
check that it gets caught. Doing this in a a hash-agnostic way could be
tricky, but maybe not impossible. I guess we could do something
probabilistic, like "set the first two bytes of the very last OID to
zero -- surely all OIDs can't start with 16 zero bits". Hmm, that might
still require knowing the size of the OIDs...

I hope to find time to do some more hands-on testing of this to see that
errors actually do get caught.

I saw you redirect stdout to a file "output", and anticipated later
commits to actually look into it. I never saw that though. (I did not
apply the patches, so I could have missed something.)

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (12 preceding siblings ...)
  2018-05-10 19:05   ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Martin Ågren
@ 2018-05-10 19:17   ` Ævar Arnfjörð Bjarmason
  2018-05-11 17:23     ` Derrick Stolee
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
  14 siblings, 1 reply; 149+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-10 19:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, peff\, sbeller\, jnareb\, stolee\


On Thu, May 10 2018, Derrick Stolee wrote:

> The behavior in this patch series does the following:
>
> 1. Near the end of 'git gc', run 'git commit-graph write'. The location
>    of this code assumes that a 'git gc --auto' has not terminated early
>    due to not meeting the auto threshold.
>
> 2. At the end of 'git fetch', run 'git commit-graph write'. This means
>    that every reachable commit will be in the commit-graph after a
>    a successful fetch, which seems a reasonable frequency. Then, the
>    only times we would be missing a reachable commit is after creating
>    one locally. There is a problem with the current patch, though: every
>    'git fetch' call runs 'git commit-graph write', even if there were no
>    ref updates or objects downloaded. Is there a simple way to detect if
>    the fetch was non-trivial?
>
> One obvious problem with this approach: if we compute this during 'gc'
> AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
> two commit-graph writes. If I were to abandon one of these patches, it
> would be the 'fetch' integration. A 'git gc' really wants to delete all
> references to unreachable commits, and without updating the commit-graph
> we may still have commit data in the commit-graph file that is not in
> the object database. In fact, deleting commits from the object database
> but not from the commit-graph will cause 'git commit-graph verify' to
> fail!
>
> I welcome discussion on these ideas, as we are venturing out of the
> "pure data structure" world and into the "user experience" world. I am
> less confident in my skills in this world, but the feature is worthless
> if it does not improve the user experience.

I really like #1 here, but I wonder why #2 is necessary.

I.e. is it critical for the performance of the commit graph feature that
it be kept really up-to-date, moreso than other things that rely on gc
--auto (e.g. the optional bitmap index)?

Even if that's the case, I think something that does this via gc --auto
is a much better option. I.e. now we have gc.auto & gc.autoPackLimit, if
the answer to my question above is "yes" this could also be accomplished
by introducing a new graph-specific gc.* setting, and --auto would just
update the graph more often, but leave the rest.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-10 19:05   ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Martin Ågren
@ 2018-05-10 19:22     ` Stefan Beller
  2018-05-11 17:23       ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Stefan Beller @ 2018-05-10 19:22 UTC (permalink / raw)
  To: Martin Ågren; +Cc: Derrick Stolee, git, peff, jnareb, stolee

On Thu, May 10, 2018 at 12:05 PM, Martin Ågren <martin.agren@gmail.com> wrote:
> On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
>
>> Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
>> are ready for full, rigorous review.
>
> I don't know about "full" and "rigorous", but I tried to offer my thoughts.
>
> A lingering feeling I have is that users could possibly benefit from
> seeing "the commit-graph has a bad foo" a bit more than just "the
> commit-graph is bad". But adding "the bar is baz, should have been
> frotz" might not bring that much. Maybe you could keep the translatable
> string somewhat simple, then, if the more technical data could be useful
> to Git developers, dump it in a non-translated format. (I guess it could
> be hidden behind a debug switch, but let's take one step at a time..)
> This is nothing I feel strongly about.
>
>>  t/t5318-commit-graph.sh                  |  25 +++++
>
> I wonder about tests. Some tests seem to use `dd` to corrupt files and
> check that it gets caught. Doing this in a a hash-agnostic way could be
> tricky, but maybe not impossible. I guess we could do something
> probabilistic, like "set the first two bytes of the very last OID to
> zero -- surely all OIDs can't start with 16 zero bits". Hmm, that might
> still require knowing the size of the OIDs...
>
> I hope to find time to do some more hands-on testing of this to see that
> errors actually do get caught.

Given that the commit graph is secondary data, the users work around
to quickly get back to a well working repository is most likely to remove
the file and regenerate it.
As a developer who wants to fix the bug, a stacktrace/datadump and the
history of git commands might be most valuable, but I agree we should
hide that behind a debug flag.

Packfiles and loose objects are primary data, which means that those
need a more advanced way to diagnose and repair them, so I would imagine
the commit graph fsck is closer to bitmaps fsck, which I would have suspected
to be found in t5310, but a quick read doesn't reveal many tests that are
checking for integrity. So I guess the test coverage here is ok, (although we
should always ask for more)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 04/12] commit-graph: verify fanout and lookup table
  2018-05-10 18:29     ` Martin Ågren
@ 2018-05-11 15:17       ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 15:17 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee; +Cc: git, peff, sbeller, jnareb

On 5/10/2018 2:29 PM, Martin Ågren wrote:
> On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
>> While running 'git commit-graph verify', verify that the object IDs
>> are listed in lexicographic order and that the fanout table correctly
>> navigates into that list of object IDs.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c | 33 +++++++++++++++++++++++++++++++++
>>   1 file changed, 33 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index ce11af1d20..b4c146c423 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -839,6 +839,9 @@ static int verify_commit_graph_error;
>>
>>   int verify_commit_graph(struct commit_graph *g)
>>   {
>> +       uint32_t i, cur_fanout_pos = 0;
>> +       struct object_id prev_oid, cur_oid;
>> +
>>          if (!g) {
>>                  graph_report(_("no commit-graph file loaded"));
>>                  return 1;
>> @@ -853,5 +856,35 @@ int verify_commit_graph(struct commit_graph *g)
>>          if (!g->chunk_commit_data)
>>                  graph_report(_("commit-graph is missing the Commit Data chunk"));
>>
>> +       for (i = 0; i < g->num_commits; i++) {
>> +               hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>> +
>> +               if (i > 0 && oidcmp(&prev_oid, &cur_oid) >= 0)
>> +                       graph_report(_("commit-graph has incorrect oid order: %s then %s"),
> Minor: I think our style would prefer s/i > 0/i/.
>
> I suppose the second check should be s/>=/>/, but it's not like it
> should matter. ;-)

It shouldn't, but a duplicate commit is still an incorrect oid order.

> I wonder if this is a message that would virtually never make sense to a
> user, but more to a developer. Leave it untranslated to make sure any
> bug reports to the list are readable to us?

Yeah, I'll make all of the errors untranslated since they are for 
developer debugging, not end-user information.

>> +
>> +               oid_to_hex(&prev_oid),
>> +               oid_to_hex(&cur_oid));
> Hmm, these two lines do not actually achieve anything?

It's a formatting bug: They complete the statement starting with 
'graph_report(' above.

>
>> +               oidcpy(&prev_oid, &cur_oid);
>> +
>> +               while (cur_oid.hash[0] > cur_fanout_pos) {
>> +                       uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
>> +                       if (i != fanout_value)
>> +                               graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
>> +                                            cur_fanout_pos, fanout_value, i);
> Same though re `_()`, even more so because of the more technical
> notation.
>
>> +
>> +                       cur_fanout_pos++;
>> +               }
>> +       }
>> +
>> +       while (cur_fanout_pos < 256) {
>> +               uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
>> +
>> +               if (g->num_commits != fanout_value)
>> +                       graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
>> +                                    cur_fanout_pos, fanout_value, i);
> Same here. Or maybe these should just give a translated user-readable
> basic idea of what is wrong and skip the details?

As someone who is responsible for probably inserting these problems, I 
think the details are important.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-10 19:17   ` Ævar Arnfjörð Bjarmason
@ 2018-05-11 17:23     ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 17:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee
  Cc: git, peff, sbeller, jnareb

On 5/10/2018 3:17 PM, Ævar Arnfjörð Bjarmason wrote:
> On Thu, May 10 2018, Derrick Stolee wrote:
>
>> The behavior in this patch series does the following:
>>
>> 1. Near the end of 'git gc', run 'git commit-graph write'. The location
>>     of this code assumes that a 'git gc --auto' has not terminated early
>>     due to not meeting the auto threshold.
>>
>> 2. At the end of 'git fetch', run 'git commit-graph write'. This means
>>     that every reachable commit will be in the commit-graph after a
>>     a successful fetch, which seems a reasonable frequency. Then, the
>>     only times we would be missing a reachable commit is after creating
>>     one locally. There is a problem with the current patch, though: every
>>     'git fetch' call runs 'git commit-graph write', even if there were no
>>     ref updates or objects downloaded. Is there a simple way to detect if
>>     the fetch was non-trivial?
>>
>> One obvious problem with this approach: if we compute this during 'gc'
>> AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
>> two commit-graph writes. If I were to abandon one of these patches, it
>> would be the 'fetch' integration. A 'git gc' really wants to delete all
>> references to unreachable commits, and without updating the commit-graph
>> we may still have commit data in the commit-graph file that is not in
>> the object database. In fact, deleting commits from the object database
>> but not from the commit-graph will cause 'git commit-graph verify' to
>> fail!
>>
>> I welcome discussion on these ideas, as we are venturing out of the
>> "pure data structure" world and into the "user experience" world. I am
>> less confident in my skills in this world, but the feature is worthless
>> if it does not improve the user experience.
> I really like #1 here, but I wonder why #2 is necessary.
>
> I.e. is it critical for the performance of the commit graph feature that
> it be kept really up-to-date, moreso than other things that rely on gc
> --auto (e.g. the optional bitmap index)?

It is not critical. The feature has been designed to have recent commits 
not in the file. For simplicity, it is probably best to limit ourselves 
to writing after a non-trivial 'gc'.

> Even if that's the case, I think something that does this via gc --auto
> is a much better option. I.e. now we have gc.auto & gc.autoPackLimit, if
> the answer to my question above is "yes" this could also be accomplished
> by introducing a new graph-specific gc.* setting, and --auto would just
> update the graph more often, but leave the rest.

This is an excellent idea for a follow-up series, if we find we want the 
commit-graph written more frequently. For now, I'm satisfied with one 
place where it is automatically computed.

I'll drop the fetch integration in my v2 series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-10 19:22     ` Stefan Beller
@ 2018-05-11 17:23       ` Derrick Stolee
  2018-05-11 17:30         ` Martin Ågren
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 17:23 UTC (permalink / raw)
  To: Stefan Beller, Martin Ågren; +Cc: Derrick Stolee, git, peff, jnareb

On 5/10/2018 3:22 PM, Stefan Beller wrote:
> On Thu, May 10, 2018 at 12:05 PM, Martin Ågren <martin.agren@gmail.com> wrote:
>> On 10 May 2018 at 19:34, Derrick Stolee <dstolee@microsoft.com> wrote:
>>
>>> Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
>>> are ready for full, rigorous review.
>> I don't know about "full" and "rigorous", but I tried to offer my thoughts.
>>
>> A lingering feeling I have is that users could possibly benefit from
>> seeing "the commit-graph has a bad foo" a bit more than just "the
>> commit-graph is bad". But adding "the bar is baz, should have been
>> frotz" might not bring that much. Maybe you could keep the translatable
>> string somewhat simple, then, if the more technical data could be useful
>> to Git developers, dump it in a non-translated format. (I guess it could
>> be hidden behind a debug switch, but let's take one step at a time..)
>> This is nothing I feel strongly about.
>>
>>>   t/t5318-commit-graph.sh                  |  25 +++++
>> I wonder about tests. Some tests seem to use `dd` to corrupt files and
>> check that it gets caught. Doing this in a a hash-agnostic way could be
>> tricky, but maybe not impossible. I guess we could do something
>> probabilistic, like "set the first two bytes of the very last OID to
>> zero -- surely all OIDs can't start with 16 zero bits". Hmm, that might
>> still require knowing the size of the OIDs...
>>
>> I hope to find time to do some more hands-on testing of this to see that
>> errors actually do get caught.
> Given that the commit graph is secondary data, the users work around
> to quickly get back to a well working repository is most likely to remove
> the file and regenerate it.
> As a developer who wants to fix the bug, a stacktrace/datadump and the
> history of git commands might be most valuable, but I agree we should
> hide that behind a debug flag.
>
> Packfiles and loose objects are primary data, which means that those
> need a more advanced way to diagnose and repair them, so I would imagine
> the commit graph fsck is closer to bitmaps fsck, which I would have suspected
> to be found in t5310, but a quick read doesn't reveal many tests that are
> checking for integrity. So I guess the test coverage here is ok, (although we
> should always ask for more)

My main goal is to help developers figure out what is wrong with a file, 
and then we can use other diagnostic tools to discover how it got into 
that state.

Martin's initial test cases are wonderful. I've adapted them to test the 
other conditions in the verify_commit_graph() method and caught some 
interesting behavior in the process. I'm preparing v2 so we can 
investigate the direction of the tests.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch
  2018-05-11 17:23       ` Derrick Stolee
@ 2018-05-11 17:30         ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-11 17:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Stefan Beller, Derrick Stolee, git, peff, jnareb

On 11 May 2018 at 19:23, Derrick Stolee <stolee@gmail.com> wrote:

> Martin's initial test cases are wonderful. I've adapted them to test the
> other conditions in the verify_commit_graph() method and caught some
> interesting behavior in the process. I'm preparing v2 so we can investigate
> the direction of the tests.

Cool, I'm glad you found them useful. One thought I had was that you
could possibly write the tests such that you introduce errors from the
back of the file. That might enable you to do less of the "backup
commit-graph file and restore it"-dance. Just a thought.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 00/12] Integrate commit-graph into fsck and gc
  2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
                     ` (13 preceding siblings ...)
  2018-05-10 19:17   ` Ævar Arnfjörð Bjarmason
@ 2018-05-11 21:15   ` Derrick Stolee
  2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
                       ` (12 more replies)
  14 siblings, 13 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

I'm sending this v2 re-roll rather quickly after the previous version
because Martin provided a framework to add tests to the 'verify'
subcommand. I took that framework and added tests for the other checks
of the commit-graph data. This also found some interesting things about
the command:

1. There were some segfaults because we were not checking for bad data
   carefully enough.

2. To avoid segfaults, we will now terminate the check early if we find
   problems earlier in the file, such as in the header, or OID lookup.

3. We were not writing newlines between reports. This now happens by
   default in graph_report().

The integration into 'fetch' is dropped (thanks Ævar!).

Derrick Stolee (12):
  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: test that 'verify' finds corruption
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: verify commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/config.txt                 |   6 +
 Documentation/git-commit-graph.txt       |  14 ++-
 Documentation/git-fsck.txt               |   3 +
 Documentation/git-gc.txt                 |   4 +
 Documentation/technical/commit-graph.txt |  22 ----
 builtin/commit-graph.c                   |  81 ++++++++++++-
 builtin/fsck.c                           |  21 ++++
 builtin/gc.c                             |   8 ++
 commit-graph.c                           | 199 ++++++++++++++++++++++++++++++-
 commit-graph.h                           |   2 +
 commit.c                                 |  13 +-
 commit.h                                 |   1 +
 t/t5318-commit-graph.sh                  | 173 +++++++++++++++++++++++++++
 13 files changed, 509 insertions(+), 38 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 01/12] commit-graph: add 'verify' subcommand
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 13:31       ` Martin Ågren
  2018-05-20 12:10       ` Jakub Narebski
  2018-05-11 21:15     ` [PATCH v2 02/12] commit-graph: verify file header information Derrick Stolee
                       ` (11 subsequent siblings)
  12 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

During review, we noticed that a FREE_AND_NULL(graph_name) was
placed after a possible 'return', and this pattern was also in
graph_read(). Fix that case, too.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  6 ++++++
 builtin/commit-graph.c             | 40 +++++++++++++++++++++++++++++++++++++-
 commit-graph.c                     |  5 +++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            | 10 ++++++++++
 5 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph read' [--object-dir <dir>]
+'git commit-graph verify' [--object-dir <dir>]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 --------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..af3101291f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+	N_("git commit-graph verify [--object-dir <objdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
 	int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+	struct commit_graph *graph = 0;
+	char *graph_name;
+
+	static struct option builtin_commit_graph_verify_options[] = {
+		OPT_STRING(0, "object-dir", &opts.obj_dir,
+			   N_("dir"),
+			   N_("The object directory to store the graph")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_verify_options,
+			     builtin_commit_graph_verify_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	graph_name = get_commit_graph_filename(opts.obj_dir);
+	graph = load_commit_graph_one(graph_name);
+	FREE_AND_NULL(graph_name);
+
+	if (!graph)
+		return 0;
+
+	return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv)
 
 	graph_name = get_commit_graph_filename(opts.obj_dir);
 	graph = load_commit_graph_one(graph_name);
+	FREE_AND_NULL(graph_name);
 
 	if (!graph)
 		die("graph file %s does not exist", graph_name);
-	FREE_AND_NULL(graph_name);
 
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "verify"))
+			return graph_verify(argc, argv);
 		if (!strcmp(argv[0], "read"))
 			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..b25aaed128 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir,
 	oids.alloc = 0;
 	oids.nr = 0;
 }
+
+int verify_commit_graph(struct commit_graph *g)
+{
+	return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..71a39c5a57 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
 			int nr_commits,
 			int append);
 
+int verify_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..6ca451dfd2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
 	objdir=".git/objects"
 '
 
+test_expect_success 'verify graph with no graph file' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify
+'
+
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
@@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+test_expect_success 'git commit-graph verify' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify >output
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 02/12] commit-graph: verify file header information
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
  2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 13:35       ` Martin Ågren
  2018-05-20 20:00       ` Jakub Narebski
  2018-05-11 21:15     ` [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption Derrick Stolee
                       ` (10 subsequent siblings)
  12 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

During a run of 'git commit-graph verify', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'. Some header
information is checked as part of load_commit_graph_one().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index b25aaed128..d2db20e49a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -818,7 +818,37 @@ void write_commit_graph(const char *obj_dir,
 	oids.nr = 0;
 }
 
+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+	va_list ap;
+	struct strbuf sb = STRBUF_INIT;
+	verify_commit_graph_error = 1;
+
+	va_start(ap, fmt);
+	strbuf_vaddf(&sb, fmt, ap);
+
+	fprintf(stderr, "%s\n", sb.buf);
+	strbuf_release(&sb);
+	va_end(ap);
+}
+
 int verify_commit_graph(struct commit_graph *g)
 {
-	return !g;
+	if (!g) {
+		graph_report("no commit-graph file loaded");
+		return 1;
+	}
+
+	verify_commit_graph_error = 0;
+
+	if (!g->chunk_oid_fanout)
+		graph_report("commit-graph is missing the OID Fanout chunk");
+	if (!g->chunk_oid_lookup)
+		graph_report("commit-graph is missing the OID Lookup chunk");
+	if (!g->chunk_commit_data)
+		graph_report("commit-graph is missing the Commit Data chunk");
+
+	return verify_commit_graph_error;
 }
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
  2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
  2018-05-11 21:15     ` [PATCH v2 02/12] commit-graph: verify file header information Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 13:43       ` Martin Ågren
  2018-05-21 18:53       ` Jakub Narebski
  2018-05-11 21:15     ` [PATCH v2 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
                       ` (9 subsequent siblings)
  12 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

Add test cases to t5318-commit-graph.sh that corrupt the commit-graph
file and check that the 'git commit-graph verify' command fails. These
tests verify the header and chunk information is checked carefully.

Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5318-commit-graph.sh | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..0cb88232fa 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' '
 	git commit-graph verify >output
 '
 
+# usage: corrupt_data <file> <pos> [<data>]
+corrupt_data() {
+	file=$1
+	pos=$2
+	data="${3:-\0}"
+	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
+}
+
+test_expect_success 'detect bad signature' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 0 "\0" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "graph signature" verify-errors
+'
+
+test_expect_success 'detect bad version number' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 4 "\02" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "graph version" verify-errors
+'
+
+test_expect_success 'detect bad hash version' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 5 "\02" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "hash version" verify-errors
+'
+
+test_expect_success 'detect too small chunk-count' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 6 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 2 verify-errors &&
+	grep "missing the OID Lookup chunk" verify-errors &&
+	grep "missing the Commit Data chunk" verify-errors
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 04/12] commit-graph: parse commit from chosen graph
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (2 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 20:50       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 05/12] commit-graph: verify fanout and lookup table Derrick Stolee
                       ` (8 subsequent siblings)
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2db20e49a..4dfff7e752 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -309,7 +309,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
 	uint32_t pos;
 
@@ -317,9 +317,21 @@ int parse_commit_in_graph(struct commit *item)
 		return 0;
 	if (item->object.parsed)
 		return 1;
+
+	if (find_commit_in_graph(item, g, &pos))
+		return fill_commit_in_graph(item, g, pos);
+
+	return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commit_graph)
+		return 0;
+
 	prepare_commit_graph();
-	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-		return fill_commit_in_graph(item, commit_graph, pos);
+	if (commit_graph)
+		return parse_commit_in_graph_one(commit_graph, item);
 	return 0;
 }
 
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 05/12] commit-graph: verify fanout and lookup table
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (3 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-11 21:15     ` [PATCH v2 06/12] commit: force commit to parse from object database Derrick Stolee
                       ` (7 subsequent siblings)
  12 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

While running 'git commit-graph verify', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

Add tests that check these corruptions are caught by the verify
subcommand. Most of the tests check the full output matches the exact
error we inserted, but since our OID order test triggers incorrect
fanout values (with possibly different numbers of output lines) we
focus only that the correct error is written in that case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 36 ++++++++++++++++++++++++++++++++++++
 t/t5318-commit-graph.sh | 31 +++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 4dfff7e752..b0fd1d5320 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -848,6 +848,9 @@ static void graph_report(const char *fmt, ...)
 
 int verify_commit_graph(struct commit_graph *g)
 {
+	uint32_t i, cur_fanout_pos = 0;
+	struct object_id prev_oid, cur_oid;
+
 	if (!g) {
 		graph_report("no commit-graph file loaded");
 		return 1;
@@ -862,5 +865,38 @@ int verify_commit_graph(struct commit_graph *g)
 	if (!g->chunk_commit_data)
 		graph_report("commit-graph is missing the Commit Data chunk");
 
+	if (verify_commit_graph_error)
+		return 1;
+
+	for (i = 0; i < g->num_commits; i++) {
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
+			graph_report("commit-graph has incorrect oid order: %s then %s",
+				     oid_to_hex(&prev_oid),
+				     oid_to_hex(&cur_oid));
+
+		oidcpy(&prev_oid, &cur_oid);
+
+		while (cur_oid.hash[0] > cur_fanout_pos) {
+			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+			if (i != fanout_value)
+				graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
+					     cur_fanout_pos, fanout_value, i);
+
+			cur_fanout_pos++;
+		}
+	}
+
+	while (cur_fanout_pos < 256) {
+		uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+
+		if (g->num_commits != fanout_value)
+			graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
+				     cur_fanout_pos, fanout_value, i);
+
+		cur_fanout_pos++;
+	}
+
 	return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 0cb88232fa..6fb306b0da 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -293,4 +293,35 @@ test_expect_success 'detect too small chunk-count' '
 	grep "missing the Commit Data chunk" verify-errors
 '
 
+test_expect_success 'detect incorrect chunk lookup value' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 25 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "improper chunk offset" verify-errors
+'
+
+test_expect_success 'detect incorrect fanout' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 128 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "fanout value" verify-errors
+'
+
+test_expect_success 'detect incorrect OID order' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1272 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep "incorrect oid order" err
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 06/12] commit: force commit to parse from object database
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (4 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 05/12] commit-graph: verify fanout and lookup table Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 20:54       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
                       ` (6 subsequent siblings)
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 13 +++++++++----
 commit.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..7c92350373 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
 {
 	enum object_type type;
 	void *buffer;
@@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (parse_commit_in_graph(item))
+	if (use_commit_graph && parse_commit_in_graph(item))
 		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
 		return quiet_on_missing ? -1 :
 			error("Could not read %s",
-			     oid_to_hex(&item->object.oid));
+					oid_to_hex(&item->object.oid));
 	if (type != OBJ_COMMIT) {
 		free(buffer);
 		return error("Object %s not a commit",
-			     oid_to_hex(&item->object.oid));
+				oid_to_hex(&item->object.oid));
 	}
 	ret = parse_commit_buffer(item, buffer, size, 0);
 	if (save_commit_buffer && !ret) {
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 	return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+	return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
 	if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 07/12] commit-graph: load a root tree from specific graph
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (5 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 06/12] commit: force commit to parse from object database Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 20:55       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 08/12] commit-graph: verify commit contents against odb Derrick Stolee
                       ` (5 subsequent siblings)
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b0fd1d5320..5bb93e533c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -357,14 +357,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *
 	return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+						 const struct commit *c)
 {
 	if (c->maybe_tree)
 		return c->maybe_tree;
 	if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
 		BUG("get_commit_tree_in_graph called from non-commit-graph commit");
 
-	return load_tree_for_commit(commit_graph, (struct commit *)c);
+	return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+	return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 08/12] commit-graph: verify commit contents against odb
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (6 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-12 21:17       ` Martin Ågren
  2018-05-15 21:12       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 09/12] fsck: verify commit-graph Derrick Stolee
                       ` (4 subsequent siblings)
  12 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

When running 'git commit-graph verify', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

Parse the commit from the graph during the initial loop through the
object IDs to guarantee we parse from the commit-graph file.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

While testing, we discovered that mutating the integer value for a
parent to be outside the accepted range causes a segmentation fault. Add
a new check in insert_parent_or_die() that prevents this fault. Check
for that error during the test, both in the typical parents and in the
list of parents for octopus merges.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
 t/t5318-commit-graph.sh |  64 +++++++++++++++++++++++++++++++
 2 files changed, 164 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 5bb93e533c..a15ad9710d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -237,6 +237,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 {
 	struct commit *c;
 	struct object_id oid;
+
+	if (pos >= g->num_commits)
+		die("invalide parent position %"PRIu64, pos);
+
 	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
 	c = lookup_commit(&oid);
 	if (!c)
@@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
 		return 1;
 
 	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit;
+
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
 		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)
 
 			cur_fanout_pos++;
 		}
+
+		graph_commit = lookup_commit(&cur_oid);
+		if (!parse_commit_in_graph_one(g, graph_commit))
+			graph_report("failed to parse %s from commit-graph", oid_to_hex(&cur_oid));
 	}
 
 	while (cur_fanout_pos < 256) {
@@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
 		cur_fanout_pos++;
 	}
 
+	if (verify_commit_graph_error)
+		return 1;
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit, *odb_commit;
+		struct commit_list *graph_parents, *odb_parents;
+		int num_parents = 0;
+
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		graph_commit = lookup_commit(&cur_oid);
+		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
+		if (parse_commit_internal(odb_commit, 0, 0)) {
+			graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid));
+			continue;
+		}
+
+		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
+			   get_commit_tree_oid(odb_commit)))
+			graph_report("root tree object ID for commit %s in commit-graph is %s != %s",
+				     oid_to_hex(&cur_oid),
+				     oid_to_hex(get_commit_tree_oid(graph_commit)),
+				     oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+		if (graph_commit->date != odb_commit->date)
+			graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"",
+				     oid_to_hex(&cur_oid),
+				     graph_commit->date,
+				     odb_commit->date);
+
+
+		graph_parents = graph_commit->parents;
+		odb_parents = odb_commit->parents;
+
+		while (graph_parents) {
+			num_parents++;
+
+			if (odb_parents == NULL)
+				graph_report("commit-graph parent list for commit %s is too long (%d)",
+					     oid_to_hex(&cur_oid),
+					     num_parents);
+
+			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
+				graph_report("commit-graph parent for %s is %s != %s",
+					     oid_to_hex(&cur_oid),
+					     oid_to_hex(&graph_parents->item->object.oid),
+					     oid_to_hex(&odb_parents->item->object.oid));
+
+			graph_parents = graph_parents->next;
+			odb_parents = odb_parents->next;
+		}
+
+		if (odb_parents != NULL)
+			graph_report("commit-graph parent list for commit %s terminates early",
+				     oid_to_hex(&cur_oid));
+
+		if (graph_commit->generation) {
+			uint32_t max_generation = 0;
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
+				    graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
+					graph_report("commit-graph has valid generation for %s but not its parent, %s",
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				if (graph_parents->item->generation > max_generation)
+					max_generation = graph_parents->item->generation;
+				graph_parents = graph_parents->next;
+			}
+
+			if (max_generation == GENERATION_NUMBER_MAX)
+				max_generation--;
+
+			if (graph_commit->generation != max_generation + 1)
+				graph_report("commit-graph has incorrect generation for %s",
+					     oid_to_hex(&cur_oid));
+		} else {
+			graph_parents = graph_commit->parents;
+
+			while (graph_parents) {
+				if (graph_parents->item->generation)
+					graph_report("commit-graph has generation ZERO for %s but not its parent, %s",
+						     oid_to_hex(&cur_oid),
+						     oid_to_hex(&graph_parents->item->object.oid));
+				graph_parents = graph_parents->next;
+			}
+		}
+	}
+
 	return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6fb306b0da..5ab268a024 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -235,8 +235,15 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+# the verify tests below expect the commit-graph to contain all
+# commits from the 'full' repo to be in the commit-graph file.
+# If the file changes the set of commits in the list, then the
+# offsets into the binary file will result in different edits
+# and the tests will likely break.
+
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
+	git show-ref -s | git commit-graph write --stdin-commits &&
 	git commit-graph verify >output
 '
 
@@ -324,4 +331,61 @@ test_expect_success 'detect incorrect OID order' '
 	grep "incorrect oid order" err
 '
 
+test_expect_success 'detect OID not in object database' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1134 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 3 verify-errors &&
+	grep "Could not read" verify-errors &&
+	grep "parent" verify-errors &&
+	grep "from object database" verify-errors
+'
+
+test_expect_success 'detect incorrect tree OID' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1312 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "root tree object ID for commit " verify-errors
+'
+
+test_expect_success 'detect incorrect parent id' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1332 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "parent " verify-errors
+'
+
+test_expect_success 'detect incorrect parent id in large edges' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1712 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep -v "^\+" err > verify-errors &&
+	test_line_count = 1 verify-errors &&
+	grep "parent " verify-errors
+'
+
+test_expect_success 'detect incorrect commit date and generation number' '
+	cd "$TRASH_DIRECTORY/full" &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	corrupt_data $objdir/info/commit-graph 1340 "\01" &&
+	corrupt_data $objdir/info/commit-graph 1344 "\01" &&
+	test_must_fail git commit-graph verify 2>err &&
+	grep "incorrect generation" err &&
+	grep "commit date" err
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 09/12] fsck: verify commit-graph
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (7 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 08/12] commit-graph: verify commit contents against odb Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-17 18:13       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 10/12] commit-graph: add '--reachable' option Derrick Stolee
                       ` (3 subsequent siblings)
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c             | 21 +++++++++++++++++++++
 t/t5318-commit-graph.sh    | 21 ++++++++++++++++++---
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 ---------------------
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	}
 
 	check_connectivity();
+
+	if (core_commit_graph) {
+		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL };
+		commit_graph_verify.argv = verify_argv;
+		commit_graph_verify.git_cmd = 1;
+
+		if (run_command(&commit_graph_verify))
+			errors_found |= ERROR_COMMIT_GRAPH;
+
+		prepare_alt_odb();
+		for (alt = alt_odb_list; alt; alt = alt->next) {
+			verify_argv[2] = "--object-dir";
+			verify_argv[3] = alt->path;
+			if (run_command(&commit_graph_verify))
+				errors_found |= ERROR_COMMIT_GRAPH;
+		}
+	}
+
 	return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 5ab268a024..91c8406d97 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'build graph using --reachable' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph write --reachable &&
+	test_path_is_file $objdir/info/commit-graph &&
+	graph_read_expect "11" "large_edges"
+'
+
+graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd "$TRASH_DIRECTORY" &&
 	git clone --bare --no-local full bare &&
@@ -335,7 +345,7 @@ test_expect_success 'detect OID not in object database' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cp $objdir/info/commit-graph commit-graph-backup &&
 	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
-	corrupt_data $objdir/info/commit-graph 1134 "\01" &&
+	corrupt_data $objdir/info/commit-graph 1134 "\00" &&
 	test_must_fail git commit-graph verify 2>err &&
 	grep -v "^\+" err > verify-errors &&
 	test_line_count = 3 verify-errors &&
@@ -348,7 +358,7 @@ test_expect_success 'detect incorrect tree OID' '
 	cd "$TRASH_DIRECTORY/full" &&
 	cp $objdir/info/commit-graph commit-graph-backup &&
 	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
-	corrupt_data $objdir/info/commit-graph 1312 "\01" &&
+	corrupt_data $objdir/info/commit-graph 1312 "\00" &&
 	test_must_fail git commit-graph verify 2>err &&
 	grep -v "^\+" err > verify-errors &&
 	test_line_count = 1 verify-errors &&
@@ -382,10 +392,15 @@ test_expect_success 'detect incorrect commit date and generation number' '
 	cp $objdir/info/commit-graph commit-graph-backup &&
 	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
 	corrupt_data $objdir/info/commit-graph 1340 "\01" &&
-	corrupt_data $objdir/info/commit-graph 1344 "\01" &&
+	corrupt_data $objdir/info/commit-graph 1344 "\00" &&
 	test_must_fail git commit-graph verify 2>err &&
 	grep "incorrect generation" err &&
 	grep "commit date" err
 '
 
+test_expect_success 'git fsck (checks commit-graph)' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git fsck
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 10/12] commit-graph: add '--reachable' option
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (8 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 09/12] fsck: verify commit-graph Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-17 18:16       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 11/12] gc: automatically write commit-graph files Derrick Stolee
                       ` (2 subsequent siblings)
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 ++++++--
 builtin/commit-graph.c             | 41 ++++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index a222cfab08..cc1715a823 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index af3101291f..7cb94a4813 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *obj_dir;
+	int reachable;
 	int stdin_packs;
 	int stdin_commits;
 	int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
 	return 0;
 }
 
+struct hex_list {
+	char **hex_strs;
+	int hex_nr;
+	int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+			   const struct object_id *oid,
+			   int flags, void *cb_data)
+{
+	struct hex_list *list = (struct hex_list*)cb_data;
+
+	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+	list->hex_nr++;
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph")),
+		OPT_BOOL(0, "reachable", &opts.reachable,
+			N_("start walk at all refs")),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			N_("scan pack-indexes listed by stdin for commits")),
 		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
-	if (opts.stdin_packs && opts.stdin_commits)
-		die(_("cannot use both --stdin-commits and --stdin-packs"));
+	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
@@ -164,6 +187,16 @@ static int graph_write(int argc, const char **argv)
 			commit_hex = lines;
 			commits_nr = lines_nr;
 		}
+	} else if (opts.reachable) {
+		struct hex_list list;
+		list.hex_nr = 0;
+		list.hex_alloc = 128;
+		ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
+
+		for_each_ref(add_ref_to_list, &list);
+
+		commit_hex = (const char **)list.hex_strs;
+		commits_nr = list.hex_nr;
 	}
 
 	write_commit_graph(opts.obj_dir,
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 11/12] gc: automatically write commit-graph files
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (9 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 10/12] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-17 18:20       ` Martin Ågren
  2018-05-11 21:15     ` [PATCH v2 12/12] commit-graph: update design document Derrick Stolee
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  12 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt | 6 ++++++
 Documentation/git-gc.txt | 4 ++++
 builtin/gc.c             | 8 ++++++++
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
 	if the system supports it. Default is true.
 
+gc.commitGraph::
+	If true, then gc will rewrite the commit-graph file after any
+	change to the object database. If '--auto' is used, then the
+	commit-graph will not be updated unless the threshold is met.
+	See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
 	If the file gc.log exists, then `git gc --auto` won't run
 	unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..8403445738 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
 	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	argv_array_pushl(&prune, "prune", "--expire", NULL);
 	argv_array_pushl(&prune_worktrees, "worktree", "prune", "--expire", NULL);
 	argv_array_pushl(&rerere, "rerere", "gc", NULL);
+	argv_array_pushl(&commit_graph, "commit-graph", "write", "--reachable", NULL);
 
 	/* default expiry time, overwritten in gc_config */
 	gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();
 
+	if (gc_commit_graph)
+		if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+			return error(FAILED_RUN, commit_graph.argv[0]);
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v2 12/12] commit-graph: update design document
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (10 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 11/12] gc: automatically write commit-graph files Derrick Stolee
@ 2018-05-11 21:15     ` Derrick Stolee
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  12 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-11 21:15 UTC (permalink / raw)
  To: git; +Cc: jnareb, avarab, martin.agren, peff, Derrick Stolee

The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
     - 'log --topo-order'
     - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand
  2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-12 13:31       ` Martin Ågren
  2018-05-14 13:27         ` Derrick Stolee
  2018-05-20 12:10       ` Jakub Narebski
  1 sibling, 1 reply; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 13:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:

>         graph_name = get_commit_graph_filename(opts.obj_dir);
>         graph = load_commit_graph_one(graph_name);
> +       FREE_AND_NULL(graph_name);
>
>         if (!graph)
>                 die("graph file %s does not exist", graph_name);
> -       FREE_AND_NULL(graph_name);

This is probably because of something I said, but this does not look
correct. The `die()` would typically print "(null)" or segfault. If the
`die()` means we don't free `graph_name`, that should be fine.

You're still leaking `graph` here (possibly nothing this patch should
worry about) and in `graph_verify()`. UNLEAK-ing it immediately before
calling `verify_commit_graph()` should be ok. I also think punting on
this UNLEAK-business entirely would be ok; I was just a bit surprised to
see one variable getting freed and the other one ignored.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 02/12] commit-graph: verify file header information
  2018-05-11 21:15     ` [PATCH v2 02/12] commit-graph: verify file header information Derrick Stolee
@ 2018-05-12 13:35       ` Martin Ågren
  2018-05-14 13:31         ` Derrick Stolee
  2018-05-20 20:00       ` Jakub Narebski
  1 sibling, 1 reply; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 13:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
> During a run of 'git commit-graph verify', list the issues with the
> header information in the commit-graph file. Some of this information
> is inferred from the loaded 'struct commit_graph'. Some header
> information is checked as part of load_commit_graph_one().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index b25aaed128..d2db20e49a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -818,7 +818,37 @@ void write_commit_graph(const char *obj_dir,
>         oids.nr = 0;
>  }
>
> +static int verify_commit_graph_error;
> +
> +static void graph_report(const char *fmt, ...)
> +{
> +       va_list ap;
> +       struct strbuf sb = STRBUF_INIT;
> +       verify_commit_graph_error = 1;
> +
> +       va_start(ap, fmt);
> +       strbuf_vaddf(&sb, fmt, ap);
> +
> +       fprintf(stderr, "%s\n", sb.buf);
> +       strbuf_release(&sb);
> +       va_end(ap);
> +}

Right, so this replaces the macro-trickery from v1, and we print a
newline after each error.

>  int verify_commit_graph(struct commit_graph *g)
>  {
> -       return !g;
> +       if (!g) {
> +               graph_report("no commit-graph file loaded");
> +               return 1;
> +       }

> +
> +       return verify_commit_graph_error;
>  }

Not sure it matters much: I suppose you could introduce the parts that I
have quoted here in the previous patch. Or maybe not.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
  2018-05-11 21:15     ` [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption Derrick Stolee
@ 2018-05-12 13:43       ` Martin Ågren
  2018-05-21 18:53       ` Jakub Narebski
  1 sibling, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 13:43 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:

> +test_expect_success 'detect bad signature' '
> +       cd "$TRASH_DIRECTORY/full" &&

I was a bit surprised at the "cd outside subshell", but then realized
that this file already does that. It will only be a problem if later
tests think they're somewhere else. Let's read on.

> +       cp $objdir/info/commit-graph commit-graph-backup &&
> +       test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +       corrupt_data $objdir/info/commit-graph 0 "\0" &&
> +       test_must_fail git commit-graph verify 2>err &&
> +       grep -v "^\+" err > verify-errors &&
> +       test_line_count = 1 verify-errors &&
> +       grep "graph signature" verify-errors
> +'
> +
> +test_expect_success 'detect bad version number' '
> +       cd "$TRASH_DIRECTORY/full" &&
> +       cp $objdir/info/commit-graph commit-graph-backup &&
> +       test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +       corrupt_data $objdir/info/commit-graph 4 "\02" &&
> +       test_must_fail git commit-graph verify 2>err &&
> +       grep -v "^\+" err > verify-errors &&
> +       test_line_count = 1 verify-errors &&
> +       grep "graph version" verify-errors
> +'
> +
> +test_expect_success 'detect bad hash version' '
> +       cd "$TRASH_DIRECTORY/full" &&
> +       cp $objdir/info/commit-graph commit-graph-backup &&
> +       test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +       corrupt_data $objdir/info/commit-graph 5 "\02" &&
> +       test_must_fail git commit-graph verify 2>err &&
> +       grep -v "^\+" err > verify-errors &&
> +       test_line_count = 1 verify-errors &&
> +       grep "hash version" verify-errors
> +'

These look a bit boiler-platey. Maybe not too bad though.

> +test_expect_success 'detect too small chunk-count' '

s/too small/bad/?

To be honest, I wrote this title without thinking too hard about the
problem. In general, it would be quite hard for `git commit-graph
verify` to say "*this* is wrong in your file" ("the number of chunks is
too small") -- it should be much easier to say "*something* is wrong".

> +       cd "$TRASH_DIRECTORY/full" &&
> +       cp $objdir/info/commit-graph commit-graph-backup &&
> +       test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +       corrupt_data $objdir/info/commit-graph 6 "\01" &&
> +       test_must_fail git commit-graph verify 2>err &&
> +       grep -v "^\+" err > verify-errors &&
> +       test_line_count = 2 verify-errors &&
> +       grep "missing the OID Lookup chunk" verify-errors &&
> +       grep "missing the Commit Data chunk" verify-errors

Maybe these tests could go with the previous patch(es). IMVHO I would
prefer reading the test with the implementation. A separate commit for
the tests might make sense if they're really tricky and need some
explaining, but I don't think that's the case here.

All of these comments are just minor nits, or not even that. I will
continue with the other patches at another time.

Thank you, I'm really looking forward to Git with commit-graph magic.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 04/12] commit-graph: parse commit from chosen graph
  2018-05-11 21:15     ` [PATCH v2 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-05-12 20:50       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 20:50 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

> -int parse_commit_in_graph(struct commit *item)
> +int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)

I think this function should be static.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 06/12] commit: force commit to parse from object database
  2018-05-11 21:15     ` [PATCH v2 06/12] commit: force commit to parse from object database Derrick Stolee
@ 2018-05-12 20:54       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 20:54 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:

> -int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
>  {
>         enum object_type type;
>         void *buffer;
> @@ -403,17 +403,17 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>                 return -1;
>         if (item->object.parsed)
>                 return 0;
> -       if (parse_commit_in_graph(item))
> +       if (use_commit_graph && parse_commit_in_graph(item))
>                 return 0;

Right, this is where we check the graph. It's the only place we need to
consider the new flag.

>         buffer = read_sha1_file(item->object.oid.hash, &type, &size);
>         if (!buffer)
>                 return quiet_on_missing ? -1 :
>                         error("Could not read %s",
> -                            oid_to_hex(&item->object.oid));
> +                                       oid_to_hex(&item->object.oid));
>         if (type != OBJ_COMMIT) {
>                 free(buffer);
>                 return error("Object %s not a commit",
> -                            oid_to_hex(&item->object.oid));
> +                               oid_to_hex(&item->object.oid));

Some spurious indentation reshuffling going on in two lines here.

> --- a/commit.h
> +++ b/commit.h
> @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>
>  int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);

Unlike my comment on a previous patch, this one is meant for external
use. That's why it's not marked as static above. Ok.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 07/12] commit-graph: load a root tree from specific graph
  2018-05-11 21:15     ` [PATCH v2 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-05-12 20:55       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 20:55 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:

> -struct tree *get_commit_tree_in_graph(const struct commit *c)
> +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
> +                                                const struct commit *c)
>  {
>         if (c->maybe_tree)
>                 return c->maybe_tree;
>         if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
>                 BUG("get_commit_tree_in_graph called from non-commit-graph commit");

Update the function name in the BUG? Not that it will ever matter. ;-)

(This one is now static, ok.)

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb
  2018-05-11 21:15     ` [PATCH v2 08/12] commit-graph: verify commit contents against odb Derrick Stolee
@ 2018-05-12 21:17       ` Martin Ågren
  2018-05-14 13:44         ` Derrick Stolee
  2018-05-15 21:12       ` Martin Ågren
  1 sibling, 1 reply; 149+ messages in thread
From: Martin Ågren @ 2018-05-12 21:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
> When running 'git commit-graph verify', compare the contents of the
> commits that are loaded from the commit-graph file with commits that are
> loaded directly from the object database. This includes checking the
> root tree object ID, commit date, and parents.
>
> Parse the commit from the graph during the initial loop through the
> object IDs to guarantee we parse from the commit-graph file.
>
> In addition, verify the generation number calculation is correct for all
> commits in the commit-graph file.
>
> While testing, we discovered that mutating the integer value for a
> parent to be outside the accepted range causes a segmentation fault. Add
> a new check in insert_parent_or_die() that prevents this fault. Check
> for that error during the test, both in the typical parents and in the
> list of parents for octopus merges.

This paragraph and the corresponding fix and test feel like a separate
patch to me. (The commit message of it could be "To test the next patch,
we threw invalid data at `git commit-graph verify, and it crashed in
pre-existing code, so let's fix that first" -- there is definitely a
connection.) Is this important enough to fast-track to master in time
for 2.18? My guess would be no.

> +
> +       if (pos >= g->num_commits)
> +               die("invalide parent position %"PRIu64, pos);

s/invalide/invalid/

> @@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
>                 return 1;
>
>         for (i = 0; i < g->num_commits; i++) {
> +               struct commit *graph_commit;
> +
>                 hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>
>                 if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
> @@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)
>
>                         cur_fanout_pos++;
>                 }
> +
> +               graph_commit = lookup_commit(&cur_oid);
> +               if (!parse_commit_in_graph_one(g, graph_commit))
> +                       graph_report("failed to parse %s from commit-graph", oid_to_hex(&cur_oid));
>         }

Could this end up giving ridiculous amounts of output? It would depend
on the input, I guess.

>         while (cur_fanout_pos < 256) {
> @@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
>                 cur_fanout_pos++;
>         }
>
> +       if (verify_commit_graph_error)
> +               return 1;

Well, here we give up before running into *too* much problem.

> +       for (i = 0; i < g->num_commits; i++) {
> +               struct commit *graph_commit, *odb_commit;
> +               struct commit_list *graph_parents, *odb_parents;
> +               int num_parents = 0;
> +
> +               hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> +
> +               graph_commit = lookup_commit(&cur_oid);
> +               odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
> +               if (parse_commit_internal(odb_commit, 0, 0)) {
> +                       graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid));
> +                       continue;
> +               }
> +
> +               if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
> +                          get_commit_tree_oid(odb_commit)))
> +                       graph_report("root tree object ID for commit %s in commit-graph is %s != %s",
> +                                    oid_to_hex(&cur_oid),
> +                                    oid_to_hex(get_commit_tree_oid(graph_commit)),
> +                                    oid_to_hex(get_commit_tree_oid(odb_commit)));
> +
> +               if (graph_commit->date != odb_commit->date)
> +                       graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"",
> +                                    oid_to_hex(&cur_oid),
> +                                    graph_commit->date,
> +                                    odb_commit->date);
> +
> +
> +               graph_parents = graph_commit->parents;
> +               odb_parents = odb_commit->parents;
> +
> +               while (graph_parents) {
> +                       num_parents++;
> +
> +                       if (odb_parents == NULL)
> +                               graph_report("commit-graph parent list for commit %s is too long (%d)",
> +                                            oid_to_hex(&cur_oid),
> +                                            num_parents);
> +
> +                       if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
> +                               graph_report("commit-graph parent for %s is %s != %s",
> +                                            oid_to_hex(&cur_oid),
> +                                            oid_to_hex(&graph_parents->item->object.oid),
> +                                            oid_to_hex(&odb_parents->item->object.oid));
> +
> +                       graph_parents = graph_parents->next;
> +                       odb_parents = odb_parents->next;
> +               }
> +
> +               if (odb_parents != NULL)
> +                       graph_report("commit-graph parent list for commit %s terminates early",
> +                                    oid_to_hex(&cur_oid));
> +
> +               if (graph_commit->generation) {
> +                       uint32_t max_generation = 0;
> +                       graph_parents = graph_commit->parents;
> +
> +                       while (graph_parents) {
> +                               if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
> +                                   graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
> +                                       graph_report("commit-graph has valid generation for %s but not its parent, %s",
> +                                                    oid_to_hex(&cur_oid),
> +                                                    oid_to_hex(&graph_parents->item->object.oid));
> +                               if (graph_parents->item->generation > max_generation)
> +                                       max_generation = graph_parents->item->generation;
> +                               graph_parents = graph_parents->next;
> +                       }
> +
> +                       if (max_generation == GENERATION_NUMBER_MAX)
> +                               max_generation--;

I'm not too familiar with these concepts. Is this a trick in preparation
for this:

> +
> +                       if (graph_commit->generation != max_generation + 1)

Any way that could give a false negative? (I'm not sure it would matter
much.) Maybe "if (!MAX && generation != max + 1)".

> +                               graph_report("commit-graph has incorrect generation for %s",
> +                                            oid_to_hex(&cur_oid));
> +               } else {
> +                       graph_parents = graph_commit->parents;
> +
> +                       while (graph_parents) {
> +                               if (graph_parents->item->generation)
> +                                       graph_report("commit-graph has generation ZERO for %s but not its parent, %s",
> +                                                    oid_to_hex(&cur_oid),
> +                                                    oid_to_hex(&graph_parents->item->object.oid));
> +                               graph_parents = graph_parents->next;
> +                       }
> +               }
> +       }
> +
>         return verify_commit_graph_error;
>  }

At this point, I should admit that I went through the above thinking
"right, makes sense, ok, sure". I was not really going "hmm, I wonder
..." This looks like the real meat of "verify", and I'll try to look it
over with a fresh pair of eyes tomorrow.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh

> +       corrupt_data $objdir/info/commit-graph 1134 "\01" &&

> +       corrupt_data $objdir/info/commit-graph 1312 "\01" &&

> +       corrupt_data $objdir/info/commit-graph 1332 "\01" &&

> +       corrupt_data $objdir/info/commit-graph 1712 "\01" &&

> +       corrupt_data $objdir/info/commit-graph 1340 "\01" &&
> +       corrupt_data $objdir/info/commit-graph 1344 "\01" &&

Could you document these numbers somehow? (Maybe even calculate them
from constant inputs, although that might be a form of premature
optimization.) When some poor soul has to derive the corresponding
numbers for a commit-graph with NewHash, they will thank you.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand
  2018-05-12 13:31       ` Martin Ågren
@ 2018-05-14 13:27         ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-14 13:27 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 5/12/2018 9:31 AM, Martin Ågren wrote:
> On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
>
>>          graph_name = get_commit_graph_filename(opts.obj_dir);
>>          graph = load_commit_graph_one(graph_name);
>> +       FREE_AND_NULL(graph_name);
>>
>>          if (!graph)
>>                  die("graph file %s does not exist", graph_name);
>> -       FREE_AND_NULL(graph_name);
> This is probably because of something I said, but this does not look
> correct. The `die()` would typically print "(null)" or segfault. If the
> `die()` means we don't free `graph_name`, that should be fine.
>
> You're still leaking `graph` here (possibly nothing this patch should
> worry about) and in `graph_verify()`. UNLEAK-ing it immediately before
> calling `verify_commit_graph()` should be ok. I also think punting on
> this UNLEAK-business entirely would be ok; I was just a bit surprised to
> see one variable getting freed and the other one ignored.

Thanks, Martin. I was just blindly searching for FREE_AND_NULL() and 
shouldn't have been so careless.

-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 02/12] commit-graph: verify file header information
  2018-05-12 13:35       ` Martin Ågren
@ 2018-05-14 13:31         ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-14 13:31 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 5/12/2018 9:35 AM, Martin Ågren wrote:
> +static int verify_commit_graph_error;
> +
> +static void graph_report(const char *fmt, ...)
> +{
> +       va_list ap;
> +       struct strbuf sb = STRBUF_INIT;
> +       verify_commit_graph_error = 1;
> +
> +       va_start(ap, fmt);
> +       strbuf_vaddf(&sb, fmt, ap);
> +
> +       fprintf(stderr, "%s\n", sb.buf);
> +       strbuf_release(&sb);
> +       va_end(ap);
> +}

That's a good idea. Makes that patch a bit less trivial and this one a 
bit less difficult.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb
  2018-05-12 21:17       ` Martin Ågren
@ 2018-05-14 13:44         ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-14 13:44 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 5/12/2018 5:17 PM, Martin Ågren wrote:
> On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
>> When running 'git commit-graph verify', compare the contents of the
>> commits that are loaded from the commit-graph file with commits that are
>> loaded directly from the object database. This includes checking the
>> root tree object ID, commit date, and parents.
>>
>> Parse the commit from the graph during the initial loop through the
>> object IDs to guarantee we parse from the commit-graph file.
>>
>> In addition, verify the generation number calculation is correct for all
>> commits in the commit-graph file.
>>
>> While testing, we discovered that mutating the integer value for a
>> parent to be outside the accepted range causes a segmentation fault. Add
>> a new check in insert_parent_or_die() that prevents this fault. Check
>> for that error during the test, both in the typical parents and in the
>> list of parents for octopus merges.
> This paragraph and the corresponding fix and test feel like a separate
> patch to me. (The commit message of it could be "To test the next patch,
> we threw invalid data at `git commit-graph verify, and it crashed in
> pre-existing code, so let's fix that first" -- there is definitely a
> connection.) Is this important enough to fast-track to master in time
> for 2.18? My guess would be no.
>
>> +
>> +       if (pos >= g->num_commits)
>> +               die("invalide parent position %"PRIu64, pos);
> s/invalide/invalid/
>
>> @@ -875,6 +879,8 @@ int verify_commit_graph(struct commit_graph *g)
>>                  return 1;
>>
>>          for (i = 0; i < g->num_commits; i++) {
>> +               struct commit *graph_commit;
>> +
>>                  hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>>
>>                  if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
>> @@ -892,6 +898,10 @@ int verify_commit_graph(struct commit_graph *g)
>>
>>                          cur_fanout_pos++;
>>                  }
>> +
>> +               graph_commit = lookup_commit(&cur_oid);
>> +               if (!parse_commit_in_graph_one(g, graph_commit))
>> +                       graph_report("failed to parse %s from commit-graph", oid_to_hex(&cur_oid));
>>          }
> Could this end up giving ridiculous amounts of output? It would depend
> on the input, I guess.
>
>>          while (cur_fanout_pos < 256) {
>> @@ -904,5 +914,95 @@ int verify_commit_graph(struct commit_graph *g)
>>                  cur_fanout_pos++;
>>          }
>>
>> +       if (verify_commit_graph_error)
>> +               return 1;
> Well, here we give up before running into *too* much problem.
>
>> +       for (i = 0; i < g->num_commits; i++) {
>> +               struct commit *graph_commit, *odb_commit;
>> +               struct commit_list *graph_parents, *odb_parents;
>> +               int num_parents = 0;
>> +
>> +               hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>> +
>> +               graph_commit = lookup_commit(&cur_oid);
>> +               odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
>> +               if (parse_commit_internal(odb_commit, 0, 0)) {
>> +                       graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid));
>> +                       continue;
>> +               }
>> +
>> +               if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
>> +                          get_commit_tree_oid(odb_commit)))
>> +                       graph_report("root tree object ID for commit %s in commit-graph is %s != %s",
>> +                                    oid_to_hex(&cur_oid),
>> +                                    oid_to_hex(get_commit_tree_oid(graph_commit)),
>> +                                    oid_to_hex(get_commit_tree_oid(odb_commit)));
>> +
>> +               if (graph_commit->date != odb_commit->date)
>> +                       graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"",
>> +                                    oid_to_hex(&cur_oid),
>> +                                    graph_commit->date,
>> +                                    odb_commit->date);
>> +
>> +
>> +               graph_parents = graph_commit->parents;
>> +               odb_parents = odb_commit->parents;
>> +
>> +               while (graph_parents) {
>> +                       num_parents++;
>> +
>> +                       if (odb_parents == NULL)
>> +                               graph_report("commit-graph parent list for commit %s is too long (%d)",
>> +                                            oid_to_hex(&cur_oid),
>> +                                            num_parents);
>> +
>> +                       if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
>> +                               graph_report("commit-graph parent for %s is %s != %s",
>> +                                            oid_to_hex(&cur_oid),
>> +                                            oid_to_hex(&graph_parents->item->object.oid),
>> +                                            oid_to_hex(&odb_parents->item->object.oid));
>> +
>> +                       graph_parents = graph_parents->next;
>> +                       odb_parents = odb_parents->next;
>> +               }
>> +
>> +               if (odb_parents != NULL)
>> +                       graph_report("commit-graph parent list for commit %s terminates early",
>> +                                    oid_to_hex(&cur_oid));
>> +
>> +               if (graph_commit->generation) {
>> +                       uint32_t max_generation = 0;
>> +                       graph_parents = graph_commit->parents;
>> +
>> +                       while (graph_parents) {
>> +                               if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
>> +                                   graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
>> +                                       graph_report("commit-graph has valid generation for %s but not its parent, %s",
>> +                                                    oid_to_hex(&cur_oid),
>> +                                                    oid_to_hex(&graph_parents->item->object.oid));
>> +                               if (graph_parents->item->generation > max_generation)
>> +                                       max_generation = graph_parents->item->generation;
>> +                               graph_parents = graph_parents->next;
>> +                       }
>> +
>> +                       if (max_generation == GENERATION_NUMBER_MAX)
>> +                               max_generation--;
> I'm not too familiar with these concepts. Is this a trick in preparation
> for this:
>
>> +
>> +                       if (graph_commit->generation != max_generation + 1)
> Any way that could give a false negative? (I'm not sure it would matter
> much.) Maybe "if (!MAX && generation != max + 1)".

You're right that this is confusing. Seems worth a comment.

When we have a commit-graph with computed generation numbers, the 
generation number for a commit is exactly one more than the maximum 
generation number of its parents EXCEPT in the case that we would have a 
generation number too large to store in the commit-graph. In this case 
(that is not currently possible with any repo in existence) we "squash" 
the generation number at GENERATION_NUMBER_MAX, so we have equality 
between the generation at the commit and the generation of a parent.

>
>> +                               graph_report("commit-graph has incorrect generation for %s",
>> +                                            oid_to_hex(&cur_oid));
>> +               } else {
>> +                       graph_parents = graph_commit->parents;
>> +
>> +                       while (graph_parents) {
>> +                               if (graph_parents->item->generation)
>> +                                       graph_report("commit-graph has generation ZERO for %s but not its parent, %s",
>> +                                                    oid_to_hex(&cur_oid),
>> +                                                    oid_to_hex(&graph_parents->item->object.oid));
>> +                               graph_parents = graph_parents->next;
>> +                       }
>> +               }
>> +       }
>> +
>>          return verify_commit_graph_error;
>>   }
> At this point, I should admit that I went through the above thinking
> "right, makes sense, ok, sure". I was not really going "hmm, I wonder
> ..." This looks like the real meat of "verify", and I'll try to look it
> over with a fresh pair of eyes tomorrow.

I appreciate the level of inspection you are giving this series!

>
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> +       corrupt_data $objdir/info/commit-graph 1134 "\01" &&
>> +       corrupt_data $objdir/info/commit-graph 1312 "\01" &&
>> +       corrupt_data $objdir/info/commit-graph 1332 "\01" &&
>> +       corrupt_data $objdir/info/commit-graph 1712 "\01" &&
>> +       corrupt_data $objdir/info/commit-graph 1340 "\01" &&
>> +       corrupt_data $objdir/info/commit-graph 1344 "\01" &&
> Could you document these numbers somehow? (Maybe even calculate them
> from constant inputs, although that might be a form of premature
> optimization.) When some poor soul has to derive the corresponding
> numbers for a commit-graph with NewHash, they will thank you.

Yeah, this is a bit of a mess. The arithmetic to get these numbers is 
not hard, so I could do that math in the script (at run time instead of 
at dev time).

I also realize that the way I write the commit-graph to be "fixed" 
before these tests is not so fixed: if a new ref is added, then the 
tests break. I'll update the commit-graph to be reachable from commits/8 
so we have a more concrete example:

git rev-parse commits/8 | git commit-graph write --stdin-commits

(This already changes the written commit-graph because of the 'git pull' 
test, so it will "help" me be sure my math is correct when recomputing 
the new offsets for the corruption tests.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 08/12] commit-graph: verify commit contents against odb
  2018-05-11 21:15     ` [PATCH v2 08/12] commit-graph: verify commit contents against odb Derrick Stolee
  2018-05-12 21:17       ` Martin Ågren
@ 2018-05-15 21:12       ` Martin Ågren
  1 sibling, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-15 21:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:

I finally sat down today and familiarized myself with the commit-graph
code a little. My biggest surprise was when I noticed that there is a
hash checksum at the end of the commit-graph-file. That in combination
with the tests where you flip some bytes...

It turns out, if my reading is right, that the hash value is written as
the commit-graph is generated, but that it is not verified as the
commit-graph is later read. I could not find any mention of your plans
here -- I understand why you would not want to verify the hash in
`load_commit_graph_one()`, at least not in every run. Anyway, this is
just an observation. Verifying the hash would affect the tests this
series adds. They might need to rewrite the hash or set some magic
environment variable. :-/ But that's for another day.

> +       for (i = 0; i < g->num_commits; i++) {
> +               struct commit *graph_commit, *odb_commit;
> +               struct commit_list *graph_parents, *odb_parents;
> +               int num_parents = 0;
> +               hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);

`num_commits` was derived as the commit-graph was loaded. It was derived
from offsets which were verified to be in the mmap-ed memory. So this
source address is guaranteed to be so, as well. Ok.

(Once brian's latest series hits master, this could use `oidread(...)`.)

> +               graph_commit = lookup_commit(&cur_oid);

Do we know this comes from the graph? Even with a more-or-less-messed-up
commit graph? See below.

> +               odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
> +               if (parse_commit_internal(odb_commit, 0, 0)) {
> +                       graph_report("failed to parse %s from object database", oid_to_hex(&cur_oid));
> +                       continue;
> +               }
> +
> +               if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
> +                          get_commit_tree_oid(odb_commit)))

`get_commit_tree_in_graph_one()` will BUG rather than return NULL. So
this will not dereference NULL. Good. But might it hit the BUG? That is,
can we trust the commit coming out of `lookup_commit()` not to have
`graph_pos == COMMIT_NOT_FROM_GRAPH`?

> +                       graph_report("root tree object ID for commit %s in commit-graph is %s != %s",
> +                                    oid_to_hex(&cur_oid),
> +                                    oid_to_hex(get_commit_tree_oid(graph_commit)),
> +                                    oid_to_hex(get_commit_tree_oid(odb_commit)));
> +
> +               if (graph_commit->date != odb_commit->date)
> +                       graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime"",
> +                                    oid_to_hex(&cur_oid),
> +                                    graph_commit->date,
> +                                    odb_commit->date);
> +
> +

(Extra blank line?)

> +               graph_parents = graph_commit->parents;
> +               odb_parents = odb_commit->parents;
> +
> +               while (graph_parents) {
> +                       num_parents++;
> +
> +                       if (odb_parents == NULL)
> +                               graph_report("commit-graph parent list for commit %s is too long (%d)",
> +                                            oid_to_hex(&cur_oid),
> +                                            num_parents);
> +
> +                       if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
> +                               graph_report("commit-graph parent for %s is %s != %s",
> +                                            oid_to_hex(&cur_oid),
> +                                            oid_to_hex(&graph_parents->item->object.oid),
> +                                            oid_to_hex(&odb_parents->item->object.oid));
> +
> +                       graph_parents = graph_parents->next;
> +                       odb_parents = odb_parents->next;
> +               }
> +
> +               if (odb_parents != NULL)
> +                       graph_report("commit-graph parent list for commit %s terminates early",
> +                                    oid_to_hex(&cur_oid));

Ok, ensure the lists are equally long and compare the entries.

> +
> +               if (graph_commit->generation) {

If the commit has a generation number (not an old commit graph)...

> +                       uint32_t max_generation = 0;
> +                       graph_parents = graph_commit->parents;
> +
> +                       while (graph_parents) {
> +                               if (graph_parents->item->generation == GENERATION_NUMBER_ZERO ||
> +                                   graph_parents->item->generation == GENERATION_NUMBER_INFINITY)
> +                                       graph_report("commit-graph has valid generation for %s but not its parent, %s",
> +                                                    oid_to_hex(&cur_oid),
> +                                                    oid_to_hex(&graph_parents->item->object.oid));

... then it's odd if a parent has no generation number or is not in the
graph file at all.

> +                               if (graph_parents->item->generation > max_generation)
> +                                       max_generation = graph_parents->item->generation;
> +                               graph_parents = graph_parents->next;
> +                       }
> +
> +                       if (max_generation == GENERATION_NUMBER_MAX)
> +                               max_generation--;
> +
> +                       if (graph_commit->generation != max_generation + 1)
> +                               graph_report("commit-graph has incorrect generation for %s",
> +                                            oid_to_hex(&cur_oid));

Ok. Thanks for considering adding a comment.

> +               } else {
> +                       graph_parents = graph_commit->parents;
> +
> +                       while (graph_parents) {
> +                               if (graph_parents->item->generation)
> +                                       graph_report("commit-graph has generation ZERO for %s but not its parent, %s",
> +                                                    oid_to_hex(&cur_oid),
> +                                                    oid_to_hex(&graph_parents->item->object.oid));
> +                               graph_parents = graph_parents->next;
> +                       }

Right.

> +               }
> +       }
> +
>         return verify_commit_graph_error;
>  }

Nice. My question about `get_commit_tree_in_graph_one()` might be me
making things much harder than they are. I sort of suspect that your
usage will be quite obviously safe in retrospect.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 09/12] fsck: verify commit-graph
  2018-05-11 21:15     ` [PATCH v2 09/12] fsck: verify commit-graph Derrick Stolee
@ 2018-05-17 18:13       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-17 18:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
> If core.commitGraph is true, verify the contents of the commit-graph
> during 'git fsck' using the 'git commit-graph verify' subcommand. Run
> this check on all alternates, as well.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 5ab268a024..91c8406d97 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
>  graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
>
> +test_expect_success 'build graph using --reachable' '
> +       cd "$TRASH_DIRECTORY/full" &&
> +       git commit-graph write --reachable &&
> +       test_path_is_file $objdir/info/commit-graph &&
> +       graph_read_expect "11" "large_edges"
> +'

This should be in the next patch.

> +graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> +graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2

(Possibly the same here.)

>  test_expect_success 'setup bare repo' '
>         cd "$TRASH_DIRECTORY" &&
>         git clone --bare --no-local full bare &&
> @@ -335,7 +345,7 @@ test_expect_success 'detect OID not in object database' '
>         cd "$TRASH_DIRECTORY/full" &&
>         cp $objdir/info/commit-graph commit-graph-backup &&
>         test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> -       corrupt_data $objdir/info/commit-graph 1134 "\01" &&
> +       corrupt_data $objdir/info/commit-graph 1134 "\00" &&

This and two similar ones as well, I guess.

Actually, I can drop them altogether and the tests still pass. Rebase
mishap?

> +test_expect_success 'git fsck (checks commit-graph)' '
> +       cd "$TRASH_DIRECTORY/full" &&
> +       git fsck
> +'

Maybe inject an error and verify that `git fsck` does indeed catch it,
i.e., it does call out to check the commit-graph.

Maybe also a run with `-c core.commitGraph=no` where the error should
not be found because the commit-graph should not be checked?

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 10/12] commit-graph: add '--reachable' option
  2018-05-11 21:15     ` [PATCH v2 10/12] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-05-17 18:16       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-17 18:16 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
> When writing commit-graph files, it can be convenient to ask for all
> reachable commits (starting at the ref set) in the resulting file. This
> is particularly helpful when writing to stdin is complicated, such as a
> future integration with 'git gc' which will call
> 'git commit-graph write --reachable' after performing cleanup of the
> object database.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |  8 ++++++--
>  builtin/commit-graph.c             | 41 ++++++++++++++++++++++++++++++++++----
>  2 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index a222cfab08..cc1715a823 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
>  +
>  With the `--stdin-packs` option, generate the new commit graph by
>  walking objects only in the specified pack-indexes. (Cannot be combined
> -with --stdin-commits.)
> +with --stdin-commits or --reachable.)

You could enclose --reachable in `...` for nicer rendering and fix
--stdin-commits as well while you're here.

>  With the `--stdin-commits` option, generate the new commit graph by
>  walking commits starting at the commits specified in stdin as a list
>  of OIDs in hex, one OID per line. (Cannot be combined with
> ---stdin-packs.)
> +--stdin-packs or --reachable.)

Ditto.

> +With the `--reachable` option, generate the new commit graph by walking
> +commits starting at all refs. (Cannot be combined with --stdin-commits
> +or --stind-packs.)

Ditto. Also, s/stind/stdin/.

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 11/12] gc: automatically write commit-graph files
  2018-05-11 21:15     ` [PATCH v2 11/12] gc: automatically write commit-graph files Derrick Stolee
@ 2018-05-17 18:20       ` Martin Ågren
  0 siblings, 0 replies; 149+ messages in thread
From: Martin Ågren @ 2018-05-17 18:20 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, jnareb, avarab, peff

On 11 May 2018 at 23:15, Derrick Stolee <dstolee@microsoft.com> wrote:
> The commit-graph file is a very helpful feature for speeding up git
> operations. In order to make it more useful, write the commit-graph file
> by default during standard garbage collection operations.

So does it really write by default...

> Add a 'gc.commitGraph' config setting that triggers writing a
> commit-graph file after any non-trivial 'git gc' command. Defaults to
> false while the commit-graph feature matures. We specifically do not

or not...? I guess the first paragraph has simply been there since
before you changed your mind about the default?

> want to turn this on by default until the commit-graph feature is fully
> integrated with history-modifying features like shallow clones.

So if someone would turn this on with a shallow clone, ... Do we want
some note (warning?) around that in the user documentation?

Martin

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 01/12] commit-graph: add 'verify' subcommand
  2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
  2018-05-12 13:31       ` Martin Ågren
@ 2018-05-20 12:10       ` Jakub Narebski
  1 sibling, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-05-20 12:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, avarab\, martin.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> If the commit-graph file becomes corrupt, we need a way to verify
> that its contents match the object database. In the manner of
> 'git fsck' we will implement a 'git commit-graph verify' subcommand
> to report all issues with the file.
>
> Add the 'verify' subcommand to the 'commit-graph' builtin and its
> documentation. The subcommand is currently a no-op except for
> loading the commit-graph into memory, which may trigger run-time
> errors that would be caught by normal use. Add a simple test that
> ensures the command returns a zero error code.
>
> If no commit-graph file exists, this is an acceptable state. Do
> not report any errors.

All right.  Nice introductory patch.

>
> During review, we noticed that a FREE_AND_NULL(graph_name) was
> placed after a possible 'return', and this pattern was also in
> graph_read(). Fix that case, too.

This should probably be a separate [micro-]patch.  Especially as Martin
Ågren noticed it is not correct...

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |  6 ++++++
>  builtin/commit-graph.c             | 40 +++++++++++++++++++++++++++++++++++++-
>  commit-graph.c                     |  5 +++++
>  commit-graph.h                     |  2 ++
>  t/t5318-commit-graph.sh            | 10 ++++++++++
>  5 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 4c97b555cc..a222cfab08 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -10,6 +10,7 @@ SYNOPSIS
>  --------
>  [verse]
>  'git commit-graph read' [--object-dir <dir>]
> +'git commit-graph verify' [--object-dir <dir>]
>  'git commit-graph write' <options> [--object-dir <dir>]
>  
>  
> @@ -52,6 +53,11 @@ existing commit-graph file.
>  Read a graph file given by the commit-graph file and output basic
>  details about the graph file. Used for debugging purposes.
>  
> +'verify'::
> +
> +Read the commit-graph file and verify its contents against the object
> +database. Used to check for corrupted data.

I wonder if it would be useful to have an option to verify commit-graph
file without accessing the object database, checking just that it is
well formed.

Anyway, it could be added later, if needed.

> +
>  
>  EXAMPLES
>  --------
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 37420ae0fd..af3101291f 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -8,10 +8,16 @@
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--object-dir <objdir>]"),
>  	N_("git commit-graph read [--object-dir <objdir>]"),
> +	N_("git commit-graph verify [--object-dir <objdir>]"),
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>  	NULL
>  };
>  
> +static const char * const builtin_commit_graph_verify_usage[] = {
> +	N_("git commit-graph verify [--object-dir <objdir>]"),
> +	NULL
> +};
> +
>  static const char * const builtin_commit_graph_read_usage[] = {
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	NULL
> @@ -29,6 +35,36 @@ static struct opts_commit_graph {
>  	int append;
>  } opts;
>  
> +
> +static int graph_verify(int argc, const char **argv)

A reminder for myself: exit code 0 means no errors.

> +{
> +	struct commit_graph *graph = 0;
> +	char *graph_name;
> +
> +	static struct option builtin_commit_graph_verify_options[] = {
> +		OPT_STRING(0, "object-dir", &opts.obj_dir,
> +			   N_("dir"),
> +			   N_("The object directory to store the graph")),
> +		OPT_END(),
> +	};
> +
> +	argc = parse_options(argc, argv, NULL,
> +			     builtin_commit_graph_verify_options,
> +			     builtin_commit_graph_verify_usage, 0);
> +
> +	if (!opts.obj_dir)
> +		opts.obj_dir = get_object_directory();

All right, simple handling of a subcommand and its options.


I still think that '--object-dir=<path>' should be a git wrapper option,
like '--git-dir=<path>' and '--work-tree=<path>' (and
'--namespace=<path>') are.  It would be command-line option equivalent
to the GIT_OBJECT_DIRECTORY environment variable, just like
--git-dir=<path> is for GIT_DIR, and --work-tree=<path> is for
GIT_WORK_TREE, etc.

This way the code would be implemented once for all commands, and there
would be no duplicated code for each git-commit-graph subcommand.

But that may be a matter of a separate patch.

> +
> +	graph_name = get_commit_graph_filename(opts.obj_dir);

This returns full path of commit-graph file, allocating it.

> +	graph = load_commit_graph_one(graph_name);

This reads the file (no checking if core.commitGraph is set, no handling
of alternatives), and verifies that:
 - file is not too small, i.e. smaller than GRAPH_MIN_SIZE
 - it has correct signature
 - it has correct graph version
 - it has correct hash version
 - chunk [offsets] fit within file
 - that OID Fanout, OID Lookup, Commit Data and Large Edges chunks are
   not repeated, though not that all required chunks are present

> +	FREE_AND_NULL(graph_name);

All right, graph_name is not used further, so we free it and set it to
NULL.

Note however that if load_commit_graph_one() finds errors, it would exit
with error code 1... but then exiting the process would free its
resources anyway.

> +
> +	if (!graph)
> +		return 0;

The only case where load_commit_graph_one() returns NULL instead of
exiting is if the file cannot be opened for reading (e.g. it does not
exists, or cannot be opened for reading), or its status cannot be read.

Side question: this is nice defensive programming, but could it happen
that file can be opened for reading, but its status cannot be read?

> +
> +	return verify_commit_graph(graph);

Here we leak graph->fd (it is neither unmapped, nor closed) but that may
not matter as we are exiting anyway.  Well, at least until it gets
libified, then maybe.

> +}
> +
>  static int graph_read(int argc, const char **argv)
>  {
>  	struct commit_graph *graph = NULL;
> @@ -50,10 +86,10 @@ static int graph_read(int argc, const char **argv)
>  
>  	graph_name = get_commit_graph_filename(opts.obj_dir);
>  	graph = load_commit_graph_one(graph_name);
> +	FREE_AND_NULL(graph_name);
>  
>  	if (!graph)
>  		die("graph file %s does not exist", graph_name);
> -	FREE_AND_NULL(graph_name);

Here we use graph_name, which has been just freed and set to NULL.

This should probably be (but I may be wrong):


	if (!graph) {
		UNLEAK(graph_name);
		die("graph file %s does not exist", graph_name);
	}
	FREE_AND_NULL(graph_name);


>  
>  	printf("header: %08x %d %d %d %d\n",
>  		ntohl(*(uint32_t*)graph->data),
> @@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			     PARSE_OPT_STOP_AT_NON_OPTION);
>  
>  	if (argc > 0) {
> +		if (!strcmp(argv[0], "verify"))
> +			return graph_verify(argc, argv);
>  		if (!strcmp(argv[0], "read"))
>  			return graph_read(argc, argv);
>  		if (!strcmp(argv[0], "write"))

All right, straightforward adding of a new subcommand.

> diff --git a/commit-graph.c b/commit-graph.c
> index a8c337dd77..b25aaed128 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -817,3 +817,8 @@ void write_commit_graph(const char *obj_dir,
>  	oids.alloc = 0;
>  	oids.nr = 0;
>  }
> +
> +int verify_commit_graph(struct commit_graph *g)
> +{
> +	return !g;
> +}

All right, nice placeholder.

> diff --git a/commit-graph.h b/commit-graph.h
> index 96cccb10f3..71a39c5a57 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
>  			int nr_commits,
>  			int append);
>  
> +int verify_commit_graph(struct commit_graph *g);
> +
>  #endif
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 77d85aefe7..6ca451dfd2 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
>  	objdir=".git/objects"
>  '
>  
> +test_expect_success 'verify graph with no graph file' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph verify
> +'

Nice to have this test here, but as it is now it is not an independent
test.  It depends on the fact that other earlier tests did not generate
graph file.

Perhaps it would be better to have a separate directory with repository
without commit graph file, and separate directory with repository with
commit graph file.  Or rename commit graph file if it exists, renaming
it back after tests finishes.  Or something like that.

> +
>  test_expect_success 'write graph with no packs' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write --object-dir . &&
> @@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
>  	test_cmp expect output
>  '
>  
> +test_expect_success 'git commit-graph verify' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph verify >output
> +'

This is amost the same tests as the one added earlier, the only
difference is its place in t/t5318-commit-graph.sh.  This test is not
independent.

Though I'm not sure if that would matter much.

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 02/12] commit-graph: verify file header information
  2018-05-11 21:15     ` [PATCH v2 02/12] commit-graph: verify file header information Derrick Stolee
  2018-05-12 13:35       ` Martin Ågren
@ 2018-05-20 20:00       ` Jakub Narebski
  1 sibling, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-05-20 20:00 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, avarab\, martin.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> During a run of 'git commit-graph verify', list the issues with the
> header information in the commit-graph file. Some of this information
> is inferred from the loaded 'struct commit_graph'. Some header
> information is checked as part of load_commit_graph_one().
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index b25aaed128..d2db20e49a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -818,7 +818,37 @@ void write_commit_graph(const char *obj_dir,
>  	oids.nr = 0;
>  }
>  
> +static int verify_commit_graph_error;
> +
> +static void graph_report(const char *fmt, ...)
> +{
> +	va_list ap;
> +	struct strbuf sb = STRBUF_INIT;
> +	verify_commit_graph_error = 1;
> +
> +	va_start(ap, fmt);
> +	strbuf_vaddf(&sb, fmt, ap);
> +
> +	fprintf(stderr, "%s\n", sb.buf);
> +	strbuf_release(&sb);
> +	va_end(ap);
> +}
> +
>  int verify_commit_graph(struct commit_graph *g)
>  {
> -	return !g;
> +	if (!g) {
> +		graph_report("no commit-graph file loaded");
> +		return 1;
> +	}

I won't be repeating what Martin said, but I agree with it.  Well, that
or make it a separate patch.

> +
> +	verify_commit_graph_error = 0;
> +

A quick reminder for myself.  The load_commit_graph_one() that is used
to fill the commit_graph parameter alreaady verifies that:
 - file is not too small, i.e. smaller than GRAPH_MIN_SIZE
 - it has correct signature
 - it has correct graph version
 - it has correct hash version
 - chunks [offsets] all fit within file
 - that OID Fanout, OID Lookup, Commit Data and Large Edges chunks are
   not repeated, though not that all required chunks are present

> +	if (!g->chunk_oid_fanout)
> +		graph_report("commit-graph is missing the OID Fanout chunk");
> +	if (!g->chunk_oid_lookup)
> +		graph_report("commit-graph is missing the OID Lookup chunk");
> +	if (!g->chunk_commit_data)
> +		graph_report("commit-graph is missing the Commit Data chunk");

This one checks that all chunks that needs to be present are present.
Nice.

There are a few more things that we can check about CHUNK LOOKUP part.
For example we would detect if file was truncated because the offset of
some chunk would be pointing outside the file... unless the truncation
falls within the last chunk.  We don't check that terminating label
(chunk "\0\0\0\0" offset) is outside file, I think.

We also don't check that positions of subsequent chunks are sorted
(increasing offsets), so that each chunk length is positive.


I also wonder if we shouldn't at least _warn_ about unknown chunks.

> +
> +	return verify_commit_graph_error;
>  }

Nice trick to be able to check as much as possible without segfaulting,
while still returning correct error result.


Testing newly intruduced functionality would be hard, unless relying on
hand-crafted files, or on some helper to produce invalid commit-graph
files.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
  2018-05-11 21:15     ` [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption Derrick Stolee
  2018-05-12 13:43       ` Martin Ågren
@ 2018-05-21 18:53       ` Jakub Narebski
  2018-05-24 16:28         ` Derrick Stolee
  1 sibling, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-21 18:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Martin Ågren,
	Jeff King, Nguyễn Thái Ngọc Duy

Derrick Stolee <dstolee@microsoft.com> writes:

> Add test cases to t5318-commit-graph.sh that corrupt the commit-graph
> file and check that the 'git commit-graph verify' command fails. These
> tests verify the header and chunk information is checked carefully.
>
> Helped-by: Martin Ågren <martin.agren@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t5318-commit-graph.sh | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 6ca451dfd2..0cb88232fa 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -240,4 +240,57 @@ test_expect_success 'git commit-graph verify' '
>  	git commit-graph verify >output
>  '
>  
> +# usage: corrupt_data <file> <pos> [<data>]
> +corrupt_data() {
> +	file=$1
> +	pos=$2
> +	data="${3:-\0}"
> +	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
> +}

First, if we do this that way (and not by adding a test helper), the use
of this function should be, I think, protected using appropriate test
prerequisite.  Not everyone has 'dd' tool installed, for example on
MS Windows.

Second, the commit-graph file format has H-byte HASH-checksum of all of
the contents excluding checksum trailer.  It feels like any corruption
should have been caught by checksum test; thus to actually test that
contents is verified we should adjust checksum too, e.g. with sha1sum if
available or with test helper... oh, actually we have t/helper/test-sha1.
Unfortulately, it looks like it has no docs (beside commit message).

> +
> +test_expect_success 'detect bad signature' '
> +	cd "$TRASH_DIRECTORY/full" &&

This 'cd' outside subshell and withou accompanying change back feels a
bit strange to me.

> +	cp $objdir/info/commit-graph commit-graph-backup &&
> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +	corrupt_data $objdir/info/commit-graph 0 "\0" &&

So 'CGPH' signature is currupted into '\0GPH'.

> +	test_must_fail git commit-graph verify 2>err &&
> +	grep -v "^\+" err > verify-errors &&

Minor nit: redirection should be cuddled to the file, i.e.:

  +	grep -v "^\+" err >verify-errors &&

A question: why do you filter-out lines starting with "+" here?

> +	test_line_count = 1 verify-errors &&
> +	grep "graph signature" verify-errors

If messages from 'git commit-graph verify' can be localized (are
translatable), then it should be i18n_grep, isn't it?

> +'
> +
> +test_expect_success 'detect bad version number' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	cp $objdir/info/commit-graph commit-graph-backup &&
> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +	corrupt_data $objdir/info/commit-graph 4 "\02" &&

All right, so we replace commit-graph format version 1 ("\01") with
version 2 ("\02").  First, why 2 and not 0?  Second, is "\02" portable?

> +	test_must_fail git commit-graph verify 2>err &&
> +	grep -v "^\+" err > verify-errors &&
> +	test_line_count = 1 verify-errors &&

The above three lines is common across all test cases; I wonder if it
would be possible to extract it into function, to avoid code
duplication.

> +	grep "graph version" verify-errors
> +'
> +
> +test_expect_success 'detect bad hash version' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	cp $objdir/info/commit-graph commit-graph-backup &&
> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +	corrupt_data $objdir/info/commit-graph 5 "\02" &&

All right, so we change / corrupt hash version from value of 1, which
means SHA-1, to value of 2... which would soon meen NewHash.  Why not
"\777" (i.e. 0xff)?

> +	test_must_fail git commit-graph verify 2>err &&
> +	grep -v "^\+" err > verify-errors &&
> +	test_line_count = 1 verify-errors &&
> +	grep "hash version" verify-errors
> +'


Note: all of the above tests check in load_commit_graph_one(), not the
one in verify_commit_graph().  Just FYI.

> +
> +test_expect_success 'detect too small chunk-count' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	cp $objdir/info/commit-graph commit-graph-backup &&
> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +	corrupt_data $objdir/info/commit-graph 6 "\01" &&
> +	test_must_fail git commit-graph verify 2>err &&
> +	grep -v "^\+" err > verify-errors &&
> +	test_line_count = 2 verify-errors &&
> +	grep "missing the OID Lookup chunk" verify-errors &&
> +	grep "missing the Commit Data chunk" verify-errors

This feels too implementation specific.  We should have at least two
chunks missing (there are 3 required chunks, and number of chunks was
changed to 1), but commit-graph format specification does not state that
OID Fanout must be first, and thus it is two remaining required chunks
that would be missing.

> +'
> +
>  test_done

One test that I would like to see that 'git commit-grph verify'
correctly detects without crashing is if commit-graph file gets
truncated at various lengths: shorter than smallest possible
commit-graph file size, in the middle of fixed header, in the middle of
chunk lookup part, in the middle of chunk, just the trailer chopped off.

Best regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
                       ` (11 preceding siblings ...)
  2018-05-11 21:15     ` [PATCH v2 12/12] commit-graph: update design document Derrick Stolee
@ 2018-05-24 16:25     ` Derrick Stolee
  2018-05-24 16:25       ` [PATCH v3 01/20] commit-graph: UNLEAK before die() Derrick Stolee
                         ` (22 more replies)
  12 siblings, 23 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

Thanks for all the feedback on v2. I've tried to make this round's
review a bit easier by splitting up the commits into smaller pieces.
Also, the test script now has less boilerplate and uses variables and
clear arithmetic to explain which bytes are being modified.

One other change worth mentioning: in "commit-graph: add '--reachable'
option" I put the ref-iteration into a new external
'write_commit_graph_reachable()' method inside commit-graph.c. This
makes the 'gc: automatically write commit-graph files' a simpler change.

Thanks,
-Stolee

Derrick Stolee (20):
  commit-graph: UNLEAK before die()
  commit-graph: fix GRAPH_MIN_SIZE
  commit-graph: parse commit from chosen graph
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: add 'verify' subcommand
  commit-graph: verify catches corrupt signature
  commit-graph: verify required chunks are present
  commit-graph: verify corrupt OID fanout and lookup
  commit-graph: verify objects exist
  commit-graph: verify root tree OIDs
  commit-graph: verify parent list
  commit-graph: verify generation number
  commit-graph: verify commit date
  commit-graph: test for corrupted octopus edge
  commit-graph: verify contents match checksum
  fsck: verify commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/config.txt                 |   6 +
 Documentation/git-commit-graph.txt       |  14 +-
 Documentation/git-fsck.txt               |   3 +
 Documentation/git-gc.txt                 |   4 +
 Documentation/technical/commit-graph.txt |  22 ---
 builtin/commit-graph.c                   |  59 +++++++-
 builtin/fsck.c                           |  21 +++
 builtin/gc.c                             |   6 +
 commit-graph.c                           | 234 +++++++++++++++++++++++++++++--
 commit-graph.h                           |   3 +
 commit.c                                 |   9 +-
 commit.h                                 |   1 +
 t/t5318-commit-graph.sh                  | 196 ++++++++++++++++++++++++++
 13 files changed, 539 insertions(+), 39 deletions(-)


base-commit: 34fdd433396ee0e3ef4de02eb2189f8226eafe4e
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 01/20] commit-graph: UNLEAK before die()
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-24 22:47         ` Stefan Beller
  2018-05-24 16:25       ` [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE Derrick Stolee
                         ` (21 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit-graph.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..f0875b8bf3 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv)
 	graph_name = get_commit_graph_filename(opts.obj_dir);
 	graph = load_commit_graph_one(graph_name);
 
-	if (!graph)
+	if (!graph) {
+		UNLEAK(graph_name);
 		die("graph file %s does not exist", graph_name);
+	}
+
 	FREE_AND_NULL(graph_name);
 
 	printf("header: %08x %d %d %d %d\n",
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  2018-05-24 16:25       ` [PATCH v3 01/20] commit-graph: UNLEAK before die() Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-26 18:46         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 03/20] commit-graph: parse commit from chosen graph Derrick Stolee
                         ` (20 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The GRAPH_MIN_SIZE macro should be the smallest size of a parsable
commit-graph file. However, the minimum number of chunks was wrong.
It is possible to write a commit-graph file with zero commits, and
that violates this macro's value.

Rewrite the macro, and use extra macros to better explain the magic
constants.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..82295f0975 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -33,10 +33,11 @@
 
 #define GRAPH_LAST_EDGE 0x80000000
 
+#define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
 #define GRAPH_CHUNKLOOKUP_WIDTH 12
-#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
-			GRAPH_OID_LEN + 8)
+#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+			+ GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
 
 char *get_commit_graph_filename(const char *obj_dir)
 {
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 03/20] commit-graph: parse commit from chosen graph
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
  2018-05-24 16:25       ` [PATCH v3 01/20] commit-graph: UNLEAK before die() Derrick Stolee
  2018-05-24 16:25       ` [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-27 10:23         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 04/20] commit: force commit to parse from object database Derrick Stolee
                         ` (19 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

Before verifying a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 82295f0975..78ba0edc80 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-int parse_commit_in_graph(struct commit *item)
+static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
 	uint32_t pos;
 
@@ -318,9 +318,21 @@ int parse_commit_in_graph(struct commit *item)
 		return 0;
 	if (item->object.parsed)
 		return 1;
+
+	if (find_commit_in_graph(item, g, &pos))
+		return fill_commit_in_graph(item, g, pos);
+
+	return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
+	if (!core_commit_graph)
+		return 0;
+
 	prepare_commit_graph();
-	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
-		return fill_commit_in_graph(item, commit_graph, pos);
+	if (commit_graph)
+		return parse_commit_in_graph_one(commit_graph, item);
 	return 0;
 }
 
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 04/20] commit: force commit to parse from object database
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (2 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 03/20] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-27 18:04         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 05/20] commit-graph: load a root tree from specific graph Derrick Stolee
                         ` (18 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

In anticipation of verifying commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 9 +++++++--
 commit.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 1d28677dfb..6eaed0174c 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
 	return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
 {
 	enum object_type type;
 	void *buffer;
@@ -403,7 +403,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (parse_commit_in_graph(item))
+	if (use_commit_graph && parse_commit_in_graph(item))
 		return 0;
 	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
 	if (!buffer)
@@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
 	return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+	return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
 	if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 05/20] commit-graph: load a root tree from specific graph
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (3 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 04/20] commit: force commit to parse from object database Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-27 19:12         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 06/20] commit-graph: add 'verify' subcommand Derrick Stolee
                         ` (17 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 78ba0edc80..25893ec096 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -358,14 +358,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *
 	return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+						 const struct commit *c)
 {
 	if (c->maybe_tree)
 		return c->maybe_tree;
 	if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
-		BUG("get_commit_tree_in_graph called from non-commit-graph commit");
+		BUG("get_commit_tree_in_graph_one called from non-commit-graph commit");
+
+	return load_tree_for_commit(g, (struct commit *)c);
+}
 
-	return load_tree_for_commit(commit_graph, (struct commit *)c);
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+	return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 06/20] commit-graph: add 'verify' subcommand
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (4 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 05/20] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-27 22:55         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 07/20] commit-graph: verify catches corrupt signature Derrick Stolee
                         ` (16 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

If the commit-graph file becomes corrupt, we need a way to verify
that its contents match the object database. In the manner of
'git fsck' we will implement a 'git commit-graph verify' subcommand
to report all issues with the file.

Add the 'verify' subcommand to the 'commit-graph' builtin and its
documentation. The subcommand is currently a no-op except for
loading the commit-graph into memory, which may trigger run-time
errors that would be caught by normal use. Add a simple test that
ensures the command returns a zero error code.

If no commit-graph file exists, this is an acceptable state. Do
not report any errors.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  6 ++++++
 builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++++++++++
 commit-graph.c                     | 26 ++++++++++++++++++++++++++
 commit-graph.h                     |  2 ++
 t/t5318-commit-graph.sh            | 10 ++++++++++
 5 files changed, 82 insertions(+)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 4c97b555cc..a222cfab08 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git commit-graph read' [--object-dir <dir>]
+'git commit-graph verify' [--object-dir <dir>]
 'git commit-graph write' <options> [--object-dir <dir>]
 
 
@@ -52,6 +53,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'verify'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 --------
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f0875b8bf3..0433dd6e20 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -8,10 +8,16 @@
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
+	N_("git commit-graph verify [--object-dir <objdir>]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
+static const char * const builtin_commit_graph_verify_usage[] = {
+	N_("git commit-graph verify [--object-dir <objdir>]"),
+	NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
 	int append;
 } opts;
 
+
+static int graph_verify(int argc, const char **argv)
+{
+	struct commit_graph *graph = 0;
+	char *graph_name;
+
+	static struct option builtin_commit_graph_verify_options[] = {
+		OPT_STRING(0, "object-dir", &opts.obj_dir,
+			   N_("dir"),
+			   N_("The object directory to store the graph")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_commit_graph_verify_options,
+			     builtin_commit_graph_verify_usage, 0);
+
+	if (!opts.obj_dir)
+		opts.obj_dir = get_object_directory();
+
+	graph_name = get_commit_graph_filename(opts.obj_dir);
+	graph = load_commit_graph_one(graph_name);
+	FREE_AND_NULL(graph_name);
+
+	if (!graph)
+		return 0;
+
+	return verify_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
 	struct commit_graph *graph = NULL;
@@ -163,6 +199,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
 	if (argc > 0) {
+		if (!strcmp(argv[0], "verify"))
+			return graph_verify(argc, argv);
 		if (!strcmp(argv[0], "read"))
 			return graph_read(argc, argv);
 		if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 25893ec096..55b41664ee 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -836,3 +836,29 @@ void write_commit_graph(const char *obj_dir,
 	oids.alloc = 0;
 	oids.nr = 0;
 }
+
+static int verify_commit_graph_error;
+
+static void graph_report(const char *fmt, ...)
+{
+	va_list ap;
+	struct strbuf sb = STRBUF_INIT;
+	verify_commit_graph_error = 1;
+
+	va_start(ap, fmt);
+	strbuf_vaddf(&sb, fmt, ap);
+
+	fprintf(stderr, "%s\n", sb.buf);
+	strbuf_release(&sb);
+	va_end(ap);
+}
+
+int verify_commit_graph(struct commit_graph *g)
+{
+	if (!g) {
+		graph_report("no commit-graph file loaded");
+		return 1;
+	}
+
+	return verify_commit_graph_error;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..71a39c5a57 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
 			int nr_commits,
 			int append);
 
+int verify_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..6ca451dfd2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
 	objdir=".git/objects"
 '
 
+test_expect_success 'verify graph with no graph file' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify
+'
+
 test_expect_success 'write graph with no packs' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git commit-graph write --object-dir . &&
@@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+test_expect_success 'git commit-graph verify' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph verify >output
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 07/20] commit-graph: verify catches corrupt signature
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (5 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 06/20] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-28 14:05         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 08/20] commit-graph: verify required chunks are present Derrick Stolee
                         ` (15 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

This is the first of several commits that add a test to check that
'git commit-graph verify' catches corruption in the commit-graph
file. The first test checks that the command catches an error in
the file signature. This is a check that exists in the existing
commit-graph reading code.

Add a helper method 'corrupt_graph_and_verify' to the test script
t5318-commit-graph.sh. This helper corrupts the commit-graph file
at a certain location, runs 'git commit-graph verify', and reports
the output to the 'err' file. This data is filtered to remove the
lines added by 'test_must_fail' when the test is run verbosely.
Then, the output is checked to contain a specific error message.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5318-commit-graph.sh | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6ca451dfd2..bd64481c7a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+# the verify tests below expect the commit-graph to contain
+# exactly the commits reachable from the commits/8 branch.
+# If the file changes the set of commits in the list, then the
+# offsets into the binary file will result in different edits
+# and the tests will likely break.
+
 test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
+	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
 	git commit-graph verify >output
 '
 
+GRAPH_BYTE_VERSION=4
+GRAPH_BYTE_HASH=5
+
+# usage: corrupt_graph_and_verify <position> <data> <string>
+# Manipulates the commit-graph file at the position
+# by inserting the data, then runs 'git commit-graph verify'
+# and places the output in the file 'err'. Test 'err' for
+# the given string.
+corrupt_graph_and_verify() {
+	pos=$1
+	data="${2:-\0}"
+	grepstr=$3
+	cd "$TRASH_DIRECTORY/full" &&
+	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
+	cp $objdir/info/commit-graph commit-graph-backup &&
+	printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
+	test_must_fail git commit-graph verify 2>test_err &&
+	grep -v "^+" test_err >err
+	grep "$grepstr" err
+}
+
+test_expect_success 'detect bad signature' '
+	corrupt_graph_and_verify 0 "\0" \
+		"graph signature"
+'
+
+test_expect_success 'detect bad version' '
+	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+		"graph version"
+'
+
+test_expect_success 'detect bad hash version' '
+	corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \
+		"hash version"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 08/20] commit-graph: verify required chunks are present
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (6 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 07/20] commit-graph: verify catches corrupt signature Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-28 17:11         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup Derrick Stolee
                         ` (14 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph file requires the following three chunks:

* OID Fanout
* OID Lookup
* Commit Data

If any of these are missing, then the 'verify' subcommand should
report a failure. This includes the chunk IDs malformed or the
chunk count is truncated.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          |  9 +++++++++
 t/t5318-commit-graph.sh | 29 +++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 55b41664ee..06e3e4f9ba 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -860,5 +860,14 @@ int verify_commit_graph(struct commit_graph *g)
 		return 1;
 	}
 
+	verify_commit_graph_error = 0;
+
+	if (!g->chunk_oid_fanout)
+		graph_report("commit-graph is missing the OID Fanout chunk");
+	if (!g->chunk_oid_lookup)
+		graph_report("commit-graph is missing the OID Lookup chunk");
+	if (!g->chunk_commit_data)
+		graph_report("commit-graph is missing the Commit Data chunk");
+
 	return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index bd64481c7a..4ef3fe3dc2 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -249,6 +249,15 @@ test_expect_success 'git commit-graph verify' '
 
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
+GRAPH_BYTE_CHUNK_COUNT=6
+GRAPH_CHUNK_LOOKUP_OFFSET=8
+GRAPH_CHUNK_LOOKUP_WIDTH=12
+GRAPH_CHUNK_LOOKUP_ROWS=5
+GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
+GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+			      1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
+GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+				2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -283,4 +292,24 @@ test_expect_success 'detect bad hash version' '
 		"hash version"
 '
 
+test_expect_success 'detect bad chunk count' '
+	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\02" \
+		"missing the Commit Data chunk"
+'
+
+test_expect_success 'detect missing OID fanout chunk' '
+	corrupt_graph_and_verify $GRAPH_BYTE_OID_FANOUT_ID "\0" \
+		"missing the OID Fanout chunk"
+'
+
+test_expect_success 'detect missing OID lookup chunk' '
+	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \
+		"missing the OID Lookup chunk"
+'
+
+test_expect_success 'detect missing commit data chunk' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \
+		"missing the Commit Data chunk"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (7 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 08/20] commit-graph: verify required chunks are present Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-30 13:34         ` Jakub Narebski
  2018-06-02  4:38         ` Duy Nguyen
  2018-05-24 16:25       ` [PATCH v3 10/20] commit-graph: verify objects exist Derrick Stolee
                         ` (13 subsequent siblings)
  22 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

In the commit-graph file, the OID fanout chunk provides an index into
the OID lookup. The 'verify' subcommand should find incorrect values
in the fanout.

Similarly, the 'verify' subcommand should find out-of-order values in
the OID lookup.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 36 ++++++++++++++++++++++++++++++++++++
 t/t5318-commit-graph.sh | 22 ++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 06e3e4f9ba..cbd1aae514 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...)
 
 int verify_commit_graph(struct commit_graph *g)
 {
+	uint32_t i, cur_fanout_pos = 0;
+	struct object_id prev_oid, cur_oid;
+
 	if (!g) {
 		graph_report("no commit-graph file loaded");
 		return 1;
@@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g)
 	if (!g->chunk_commit_data)
 		graph_report("commit-graph is missing the Commit Data chunk");
 
+	if (verify_commit_graph_error)
+		return verify_commit_graph_error;
+
+	for (i = 0; i < g->num_commits; i++) {
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
+			graph_report("commit-graph has incorrect OID order: %s then %s",
+				     oid_to_hex(&prev_oid),
+				     oid_to_hex(&cur_oid));
+
+		oidcpy(&prev_oid, &cur_oid);
+
+		while (cur_oid.hash[0] > cur_fanout_pos) {
+			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+			if (i != fanout_value)
+				graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
+					     cur_fanout_pos, fanout_value, i);
+
+			cur_fanout_pos++;
+		}
+	}
+
+	while (cur_fanout_pos < 256) {
+		uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
+
+		if (g->num_commits != fanout_value)
+			graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
+				     cur_fanout_pos, fanout_value, i);
+
+		cur_fanout_pos++;
+	}
+
 	return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 4ef3fe3dc2..c050ef980b 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
 	git commit-graph verify >output
 '
 
+HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
 GRAPH_BYTE_CHUNK_COUNT=6
@@ -258,6 +259,12 @@ GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
 			      1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
 GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
 				2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
+GRAPH_FANOUT_OFFSET=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
+			  $GRAPH_CHUNK_LOOKUP_WIDTH \* $GRAPH_CHUNK_LOOKUP_ROWS`
+GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
+GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
+GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
+GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -312,4 +319,19 @@ test_expect_success 'detect missing commit data chunk' '
 		"missing the Commit Data chunk"
 '
 
+test_expect_success 'detect incorrect fanout' '
+	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT1 "\01" \
+		"fanout value"
+'
+
+test_expect_success 'detect incorrect fanout' '
+	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
+		"fanout value"
+'
+
+test_expect_success 'detect incorrect OID order' '
+	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ORDER "\01" \
+		"incorrect OID order"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 10/20] commit-graph: verify objects exist
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (8 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-30 19:22         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 11/20] commit-graph: verify root tree OIDs Derrick Stolee
                         ` (12 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

In the 'verify' subcommand, load commits directly from the object
database to ensure they exist. Parse by skipping the commit-graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 20 ++++++++++++++++++++
 t/t5318-commit-graph.sh |  7 +++++++
 2 files changed, 27 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index cbd1aae514..0420ebcd87 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
 {
 	struct commit *c;
 	struct object_id oid;
+
+	if (pos >= g->num_commits)
+		die("invalid parent position %"PRIu64, pos);
+
 	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
 	c = lookup_commit(&oid);
 	if (!c)
@@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g)
 		cur_fanout_pos++;
 	}
 
+	if (verify_commit_graph_error)
+		return verify_commit_graph_error;
+
+	for (i = 0; i < g->num_commits; i++) {
+		struct commit *odb_commit;
+
+		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
+		if (parse_commit_internal(odb_commit, 0, 0)) {
+			graph_report("failed to parse %s from object database",
+				     oid_to_hex(&cur_oid));
+			continue;
+		}
+	}
+
 	return verify_commit_graph_error;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index c050ef980b..996a016239 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
 	git commit-graph verify >output
 '
 
+NUM_COMMITS=9
 HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
@@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
 GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
 GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
 GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
+GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' '
 		"incorrect OID order"
 '
 
+test_expect_success 'detect OID not in object database' '
+	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \
+		"from object database"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 11/20] commit-graph: verify root tree OIDs
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (9 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 10/20] commit-graph: verify objects exist Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-05-30 22:24         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 12/20] commit-graph: verify parent list Derrick Stolee
                         ` (11 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The 'verify' subcommand must compare the commit content parsed from the
commit-graph and compare it against the content in the object database.
Use lookup_commit() and parse_commit_in_graph_one() to parse the commits
from the graph and compare against a commit that is loaded separately
and parsed directly from the object database.

Add checks for the root tree OID.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 17 ++++++++++++++++-
 t/t5318-commit-graph.sh |  7 +++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0420ebcd87..19ea369fc6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g)
 		return verify_commit_graph_error;
 
 	for (i = 0; i < g->num_commits; i++) {
+		struct commit *graph_commit;
+
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
 		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g)
 
 			cur_fanout_pos++;
 		}
+
+		graph_commit = lookup_commit(&cur_oid);
+		if (!parse_commit_in_graph_one(g, graph_commit))
+			graph_report("failed to parse %s from commit-graph",
+				     oid_to_hex(&cur_oid));
 	}
 
 	while (cur_fanout_pos < 256) {
@@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g)
 		return verify_commit_graph_error;
 
 	for (i = 0; i < g->num_commits; i++) {
-		struct commit *odb_commit;
+		struct commit *graph_commit, *odb_commit;
 
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
+		graph_commit = lookup_commit(&cur_oid);
 		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
 		if (parse_commit_internal(odb_commit, 0, 0)) {
 			graph_report("failed to parse %s from object database",
 				     oid_to_hex(&cur_oid));
 			continue;
 		}
+
+		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
+			   get_commit_tree_oid(odb_commit)))
+			graph_report("root tree OID for commit %s in commit-graph is %s != %s",
+				     oid_to_hex(&cur_oid),
+				     oid_to_hex(get_commit_tree_oid(graph_commit)),
+				     oid_to_hex(get_commit_tree_oid(odb_commit)));
 	}
 
 	return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 996a016239..21cc8e82f3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
 GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
 GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
 GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
+GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS`
+GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -341,4 +343,9 @@ test_expect_success 'detect OID not in object database' '
 		"from object database"
 '
 
+test_expect_success 'detect incorrect tree OID' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_TREE "\01" \
+		"root tree OID for commit"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 12/20] commit-graph: verify parent list
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (10 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 11/20] commit-graph: verify root tree OIDs Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-06-01 23:21         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 13/20] commit-graph: verify generation number Derrick Stolee
                         ` (10 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph file stores parents in a two-column portion of the
commit data chunk. If there is only one parent, then the second column
stores 0xFFFFFFFF to indicate no second parent.

The 'verify' subcommand checks the parent list for the commit loaded
from the commit-graph and the one parsed from the object database. Test
these checks for corrupt parents, too many parents, and wrong parents.

The octopus merge will be tested in a later commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 25 +++++++++++++++++++++++++
 t/t5318-commit-graph.sh | 18 ++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 19ea369fc6..fff22dc0c3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -921,6 +921,7 @@ int verify_commit_graph(struct commit_graph *g)
 
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
+		struct commit_list *graph_parents, *odb_parents;
 
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
@@ -938,6 +939,30 @@ int verify_commit_graph(struct commit_graph *g)
 				     oid_to_hex(&cur_oid),
 				     oid_to_hex(get_commit_tree_oid(graph_commit)),
 				     oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+		graph_parents = graph_commit->parents;
+		odb_parents = odb_commit->parents;
+
+		while (graph_parents) {
+			if (odb_parents == NULL) {
+				graph_report("commit-graph parent list for commit %s is too long",
+					     oid_to_hex(&cur_oid));
+				break;
+			}
+
+			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
+				graph_report("commit-graph parent for %s is %s != %s",
+					     oid_to_hex(&cur_oid),
+					     oid_to_hex(&graph_parents->item->object.oid),
+					     oid_to_hex(&odb_parents->item->object.oid));
+
+			graph_parents = graph_parents->next;
+			odb_parents = odb_parents->next;
+		}
+
+		if (odb_parents != NULL)
+			graph_report("commit-graph parent list for commit %s terminates early",
+				     oid_to_hex(&cur_oid));
 	}
 
 	return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 21cc8e82f3..12f0d7f54d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -269,6 +269,9 @@ GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
 GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
 GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS`
 GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
+GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
+GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
+GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -348,4 +351,19 @@ test_expect_success 'detect incorrect tree OID' '
 		"root tree OID for commit"
 '
 
+test_expect_success 'detect incorrect parent int-id' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_PARENT "\01" \
+		"invalid parent"
+'
+
+test_expect_success 'detect extra parent int-id' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_EXTRA_PARENT "\00" \
+		"is too long"
+'
+
+test_expect_success 'detect incorrect tree OID' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_WRONG_PARENT "\01" \
+		"commit-graph parent for"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 13/20] commit-graph: verify generation number
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (11 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 12/20] commit-graph: verify parent list Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-06-02 12:23         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 14/20] commit-graph: verify commit date Derrick Stolee
                         ` (9 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

While iterating through the commit parents, perform the generation
number calculation and compare against the value stored in the
commit-graph.

The tests demonstrate that having a different set of parents affects
the generation number calculation, and this value propagates to
descendants. Hence, we drop the single-line condition on the output.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 18 ++++++++++++++++++
 t/t5318-commit-graph.sh |  6 ++++++
 2 files changed, 24 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index fff22dc0c3..ead92460c1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -922,6 +922,7 @@ int verify_commit_graph(struct commit_graph *g)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
+		uint32_t max_generation = 0;
 
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
@@ -956,6 +957,9 @@ int verify_commit_graph(struct commit_graph *g)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
+			if (graph_parents->item->generation > max_generation)
+				max_generation = graph_parents->item->generation;
+
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
 		}
@@ -963,6 +967,20 @@ int verify_commit_graph(struct commit_graph *g)
 		if (odb_parents != NULL)
 			graph_report("commit-graph parent list for commit %s terminates early",
 				     oid_to_hex(&cur_oid));
+
+		/*
+		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
+		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+		 * extra logic in the following condition.
+		 */
+		if (max_generation == GENERATION_NUMBER_MAX)
+			max_generation--;
+
+		if (graph_commit->generation != max_generation + 1)
+			graph_report("commit-graph generation for commit %s is %u != %u",
+				     oid_to_hex(&cur_oid),
+				     graph_commit->generation,
+				     max_generation + 1);
 	}
 
 	return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 12f0d7f54d..673b0d37d5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -272,6 +272,7 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
 GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
+GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -366,4 +367,9 @@ test_expect_success 'detect incorrect tree OID' '
 		"commit-graph parent for"
 '
 
+test_expect_success 'detect incorrect generation number' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
+		"generation"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 14/20] commit-graph: verify commit date
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (12 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 13/20] commit-graph: verify generation number Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-06-02 12:29         ` Jakub Narebski
  2018-05-24 16:25       ` [PATCH v3 15/20] commit-graph: test for corrupted octopus edge Derrick Stolee
                         ` (8 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 6 ++++++
 t/t5318-commit-graph.sh | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index ead92460c1..d2b291aca2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -981,6 +981,12 @@ int verify_commit_graph(struct commit_graph *g)
 				     oid_to_hex(&cur_oid),
 				     graph_commit->generation,
 				     max_generation + 1);
+
+		if (graph_commit->date != odb_commit->date)
+			graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime,
+				     oid_to_hex(&cur_oid),
+				     graph_commit->date,
+				     odb_commit->date);
 	}
 
 	return verify_commit_graph_error;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 673b0d37d5..58adb8246d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -273,6 +273,7 @@ GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
 GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
+GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -372,4 +373,9 @@ test_expect_success 'detect incorrect generation number' '
 		"generation"
 '
 
+test_expect_success 'detect incorrect commit date' '
+	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATE "\01" \
+		"commit date"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 15/20] commit-graph: test for corrupted octopus edge
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (13 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 14/20] commit-graph: verify commit date Derrick Stolee
@ 2018-05-24 16:25       ` Derrick Stolee
  2018-06-02 12:39         ` Jakub Narebski
  2018-05-24 16:26       ` [PATCH v3 16/20] commit-graph: verify contents match checksum Derrick Stolee
                         ` (7 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph file has an extra chunk to store the parent int-ids for
parents beyond the first parent for octopus merges. Our test repo has a
single octopus merge that we can manipulate to demonstrate the 'verify'
subcommand detects incorrect values in that chunk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t5318-commit-graph.sh | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 58adb8246d..240aef6add 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -248,6 +248,7 @@ test_expect_success 'git commit-graph verify' '
 '
 
 NUM_COMMITS=9
+NUM_OCTOPUS_EDGES=2
 HASH_LEN=20
 GRAPH_BYTE_VERSION=4
 GRAPH_BYTE_HASH=5
@@ -274,6 +275,10 @@ GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
 GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
 GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
 GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
+GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
+GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
+				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
+GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -378,4 +383,9 @@ test_expect_success 'detect incorrect commit date' '
 		"commit date"
 '
 
+test_expect_success 'detect incorrect parent for octopus merge' '
+	corrupt_graph_and_verify $GRAPH_BYTE_OCTOPUS "\01" \
+		"invalid parent"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 16/20] commit-graph: verify contents match checksum
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (14 preceding siblings ...)
  2018-05-24 16:25       ` [PATCH v3 15/20] commit-graph: test for corrupted octopus edge Derrick Stolee
@ 2018-05-24 16:26       ` Derrick Stolee
  2018-05-30 12:35         ` SZEDER Gábor
  2018-06-02 15:52         ` Jakub Narebski
  2018-05-24 16:26       ` [PATCH v3 17/20] fsck: verify commit-graph Derrick Stolee
                         ` (6 subsequent siblings)
  22 siblings, 2 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:26 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph file ends with a SHA1 hash of the previous contents. If
a commit-graph file has errors but the checksum hash is correct, then we
know that the problem is a bug in Git and not simply file corruption
after-the-fact.

Compute the checksum right away so it is the first error that appears,
and make the message translatable since this error can be "corrected" by
a user by simply deleting the file and recomputing. The rest of the
errors are useful only to developers.

Be sure to continue checking the rest of the file data if the checksum
is wrong. This is important for our tests, as we break the checksum as
we modify bytes of the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 16 ++++++++++++++--
 t/t5318-commit-graph.sh |  6 ++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index d2b291aca2..a33600c584 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir,
 	oids.nr = 0;
 }
 
+#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
 static int verify_commit_graph_error;
 
 static void graph_report(const char *fmt, ...)
@@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...)
 int verify_commit_graph(struct commit_graph *g)
 {
 	uint32_t i, cur_fanout_pos = 0;
-	struct object_id prev_oid, cur_oid;
+	struct object_id prev_oid, cur_oid, checksum;
+	struct hashfile *f;
+	int devnull;
 
 	if (!g) {
 		graph_report("no commit-graph file loaded");
@@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g)
 	if (verify_commit_graph_error)
 		return verify_commit_graph_error;
 
+	devnull = open("/dev/null", O_WRONLY);
+	f = hashfd(devnull, NULL);
+	hashwrite(f, g->data, g->data_len - g->hash_len);
+	finalize_hashfile(f, checksum.hash, CSUM_CLOSE);
+	if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) {
+		graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));
+		verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
+	}
+
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit;
 
@@ -916,7 +928,7 @@ int verify_commit_graph(struct commit_graph *g)
 		cur_fanout_pos++;
 	}
 
-	if (verify_commit_graph_error)
+	if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH)
 		return verify_commit_graph_error;
 
 	for (i = 0; i < g->num_commits; i++) {
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 240aef6add..2680a2ebff 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -279,6 +279,7 @@ GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
 GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
 				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
 GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
+GRAPH_BYTE_FOOTER=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4 \* $NUM_OCTOPUS_EDGES`
 
 # usage: corrupt_graph_and_verify <position> <data> <string>
 # Manipulates the commit-graph file at the position
@@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopus merge' '
 		"invalid parent"
 '
 
+test_expect_success 'detect invalid checksum hash' '
+	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
+		"incorrect checksum"
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 17/20] fsck: verify commit-graph
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (15 preceding siblings ...)
  2018-05-24 16:26       ` [PATCH v3 16/20] commit-graph: verify contents match checksum Derrick Stolee
@ 2018-05-24 16:26       ` Derrick Stolee
  2018-06-02 16:17         ` Jakub Narebski
  2018-05-24 16:26       ` [PATCH v3 18/20] commit-graph: add '--reachable' option Derrick Stolee
                         ` (5 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:26 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

If core.commitGraph is true, verify the contents of the commit-graph
during 'git fsck' using the 'git commit-graph verify' subcommand. Run
this check on all alternates, as well.

We use a new process for two reasons:

1. The subcommand decouples the details of loading and verifying a
   commit-graph file from the other fsck details.

2. The commit-graph verification requires the commits to be loaded
   in a specific order to guarantee we parse from the commit-graph
   file for some objects and from the object database for others.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-fsck.txt |  3 +++
 builtin/fsck.c             | 21 +++++++++++++++++++++
 t/t5318-commit-graph.sh    |  8 ++++++++
 3 files changed, 32 insertions(+)

diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
index b9f060e3b2..ab9a93fb9b 100644
--- a/Documentation/git-fsck.txt
+++ b/Documentation/git-fsck.txt
@@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives
 (i.e., you can just remove them and do an 'rsync' with some other site in
 the hopes that somebody else has the object you have corrupted).
 
+If core.commitGraph is true, the commit-graph file will also be inspected
+using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
+
 Extracted Diagnostics
 ---------------------
 
diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..a6d5045b77 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	}
 
 	check_connectivity();
+
+	if (core_commit_graph) {
+		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
+		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL };
+		commit_graph_verify.argv = verify_argv;
+		commit_graph_verify.git_cmd = 1;
+
+		if (run_command(&commit_graph_verify))
+			errors_found |= ERROR_COMMIT_GRAPH;
+
+		prepare_alt_odb();
+		for (alt = alt_odb_list; alt; alt = alt->next) {
+			verify_argv[2] = "--object-dir";
+			verify_argv[3] = alt->path;
+			if (run_command(&commit_graph_verify))
+				errors_found |= ERROR_COMMIT_GRAPH;
+		}
+	}
+
 	return errors_found;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2680a2ebff..4941937163 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -394,4 +394,12 @@ test_expect_success 'detect invalid checksum hash' '
 		"incorrect checksum"
 '
 
+test_expect_success 'git fsck (checks commit-graph)' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git fsck &&
+	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
+		"incorrect checksum" &&
+	test_must_fail git fsck
+'
+
 test_done
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 18/20] commit-graph: add '--reachable' option
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (16 preceding siblings ...)
  2018-05-24 16:26       ` [PATCH v3 17/20] fsck: verify commit-graph Derrick Stolee
@ 2018-05-24 16:26       ` Derrick Stolee
  2018-06-02 17:34         ` Jakub Narebski
  2018-05-24 16:26       ` [PATCH v3 19/20] gc: automatically write commit-graph files Derrick Stolee
                         ` (4 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:26 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
write_commit_graph_reachable() after performing cleanup of the object
database.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 ++++++--
 builtin/commit-graph.c             | 16 ++++++++++++----
 commit-graph.c                     | 32 ++++++++++++++++++++++++++++++++
 commit-graph.h                     |  1 +
 t/t5318-commit-graph.sh            | 10 ++++++++++
 5 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index a222cfab08..dececb79d7 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with `--stdin-commits` or `--reachable`.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+`--stdin-packs` or `--reachable`.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with `--stdin-commits`
+or `--stdin-packs`.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0433dd6e20..20ce6437ae 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph [--object-dir <objdir>]"),
 	N_("git commit-graph read [--object-dir <objdir>]"),
 	N_("git commit-graph verify [--object-dir <objdir>]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
@@ -24,12 +24,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
 	NULL
 };
 
 static struct opts_commit_graph {
 	const char *obj_dir;
+	int reachable;
 	int stdin_packs;
 	int stdin_commits;
 	int append;
@@ -130,6 +131,8 @@ static int graph_write(int argc, const char **argv)
 		OPT_STRING(0, "object-dir", &opts.obj_dir,
 			N_("dir"),
 			N_("The object directory to store the graph")),
+		OPT_BOOL(0, "reachable", &opts.reachable,
+			N_("start walk at all refs")),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			N_("scan pack-indexes listed by stdin for commits")),
 		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
@@ -143,11 +146,16 @@ static int graph_write(int argc, const char **argv)
 			     builtin_commit_graph_write_options,
 			     builtin_commit_graph_write_usage, 0);
 
-	if (opts.stdin_packs && opts.stdin_commits)
-		die(_("cannot use both --stdin-commits and --stdin-packs"));
+	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
 	if (!opts.obj_dir)
 		opts.obj_dir = get_object_directory();
 
+	if (opts.reachable) {
+		write_commit_graph_reachable(opts.obj_dir, opts.append);
+		return 0;
+	}
+
 	if (opts.stdin_packs || opts.stdin_commits) {
 		struct strbuf buf = STRBUF_INIT;
 		lines_nr = 0;
diff --git a/commit-graph.c b/commit-graph.c
index a33600c584..057d734926 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -6,6 +6,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "refs.h"
 #include "revision.h"
 #include "sha1-lookup.h"
 #include "commit-graph.h"
@@ -651,6 +652,37 @@ static void compute_generation_numbers(struct packed_commit_list* commits)
 	}
 }
 
+struct hex_list {
+	char **hex_strs;
+	int hex_nr;
+	int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+			   const struct object_id *oid,
+			   int flags, void *cb_data)
+{
+	struct hex_list *list = (struct hex_list*)cb_data;
+
+	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+	list->hex_nr++;
+	return 0;
+}
+
+void write_commit_graph_reachable(const char *obj_dir, int append)
+{
+	struct hex_list list;
+	list.hex_nr = 0;
+	list.hex_alloc = 128;
+	ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
+
+	for_each_ref(add_ref_to_list, &list);
+
+	write_commit_graph(obj_dir, NULL, 0, (const char **)list.hex_strs, list.hex_nr, append);
+}
+
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
diff --git a/commit-graph.h b/commit-graph.h
index 71a39c5a57..9a06a5f188 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -46,6 +46,7 @@ struct commit_graph {
 
 struct commit_graph *load_commit_graph_one(const char *graph_file);
 
+void write_commit_graph_reachable(const char *obj_dir, int append);
 void write_commit_graph(const char *obj_dir,
 			const char **pack_indexes,
 			int nr_packs,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 4941937163..a659620332 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
 graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
 graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
 
+test_expect_success 'build graph using --reachable' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit-graph write --reachable &&
+	test_path_is_file $objdir/info/commit-graph &&
+	graph_read_expect "11" "large_edges"
+'
+
+graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
+graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
+
 test_expect_success 'setup bare repo' '
 	cd "$TRASH_DIRECTORY" &&
 	git clone --bare --no-local full bare &&
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 19/20] gc: automatically write commit-graph files
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (17 preceding siblings ...)
  2018-05-24 16:26       ` [PATCH v3 18/20] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-05-24 16:26       ` Derrick Stolee
  2018-06-02 18:03         ` Jakub Narebski
  2018-05-24 16:26       ` [PATCH v3 20/20] commit-graph: update design document Derrick Stolee
                         ` (3 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:26 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command. Defaults to
false while the commit-graph feature matures. We specifically do not
want to turn this on by default until the commit-graph feature is fully
integrated with history-modifying features like shallow clones.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config.txt |  6 ++++++
 Documentation/git-gc.txt |  4 ++++
 builtin/gc.c             |  6 ++++++
 t/t5318-commit-graph.sh  | 14 ++++++++++++++
 4 files changed, 30 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 11f027194e..9a3abd87e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1553,6 +1553,12 @@ gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
 	if the system supports it. Default is true.
 
+gc.commitGraph::
+	If true, then gc will rewrite the commit-graph file after any
+	change to the object database. If '--auto' is used, then the
+	commit-graph will not be updated unless the threshold is met.
+	See linkgit:git-commit-graph[1] for details.
+
 gc.logExpiry::
 	If the file gc.log exists, then `git gc --auto` won't run
 	unless that file is more than 'gc.logExpiry' old.  Default is
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..efd214a59f 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -20,6 +20,7 @@
 #include "argv-array.h"
 #include "commit.h"
 #include "packfile.h"
+#include "commit-graph.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -34,6 +35,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 0;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -121,6 +123,7 @@ static void gc_config(void)
 	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
 	git_config_get_int("gc.auto", &gc_auto_threshold);
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
+	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
@@ -480,6 +483,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (pack_garbage.nr > 0)
 		clean_pack_garbage();
 
+	if (gc_commit_graph)
+		write_commit_graph_reachable(get_object_directory(), 0);
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a659620332..d20b17586f 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full repo' '
 	test_cmp expect output
 '
 
+test_expect_success 'check that gc clears commit-graph' '
+	cd "$TRASH_DIRECTORY/full" &&
+	git commit --allow-empty -m "blank" &&
+	git commit-graph write --reachable &&
+	cp $objdir/info/commit-graph commit-graph-before-gc &&
+	git reset --hard HEAD~1 &&
+	git config gc.commitGraph true &&
+	git gc &&
+	cp $objdir/info/commit-graph commit-graph-after-gc &&
+	! test_cmp commit-graph-before-gc commit-graph-after-gc &&
+	git commit-graph write --reachable &&
+	test_cmp commit-graph-after-gc $objdir/info/commit-graph
+'
+
 # the verify tests below expect the commit-graph to contain
 # exactly the commits reachable from the commits/8 branch.
 # If the file changes the set of commits in the list, then the
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* [PATCH v3 20/20] commit-graph: update design document
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (18 preceding siblings ...)
  2018-05-24 16:26       ` [PATCH v3 19/20] gc: automatically write commit-graph files Derrick Stolee
@ 2018-05-24 16:26       ` Derrick Stolee
  2018-06-02 18:27         ` Jakub Narebski
  2018-05-24 21:15       ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Ævar Arnfjörð Bjarmason
                         ` (2 subsequent siblings)
  22 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:26 UTC (permalink / raw)
  To: git; +Cc: gitster, jnareb, stolee, avarab, marten.agren, peff, Derrick Stolee

The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Also remove the section on lazy-loading trees, as that was completed
in an earlier patch series.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index e1a883eb46..c664acbd76 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -130,25 +127,6 @@ Future Work
     - 'log --topo-order'
     - 'tag --merged'
 
-- Currently, parse_commit_gently() requires filling in the root tree
-  object for a commit. This passes through lookup_tree() and consequently
-  lookup_object(). Also, it calls lookup_commit() when loading the parents.
-  These method calls check the ODB for object existence, even if the
-  consumer does not need the content. For example, we do not need the
-  tree contents when computing merge bases. Now that commit parsing is
-  removed from the computation time, these lookup operations are the
-  slowest operations keeping graph walks from being fast. Consider
-  loading these objects without verifying their existence in the ODB and
-  only loading them fully when consumers need them. Consider a method
-  such as "ensure_tree_loaded(commit)" that fully loads a tree before
-  using commit->tree.
-
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.16.2.329.gfb62395de6


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption
  2018-05-21 18:53       ` Jakub Narebski
@ 2018-05-24 16:28         ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-24 16:28 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee
  Cc: git, Ævar Arnfjörð Bjarmason, Martin Ågren,
	Jeff King, Nguyễn Thái Ngọc Duy

On 5/21/2018 2:53 PM, Jakub Narebski wrote:
>> +corrupt_data() {
>> +	file=$1
>> +	pos=$2
>> +	data="${3:-\0}"
>> +	printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
>> +}
> First, if we do this that way (and not by adding a test helper), the use
> of this function should be, I think, protected using appropriate test
> prerequisite.  Not everyone has 'dd' tool installed, for example on
> MS Windows.

Windows does not, but it is also missing many things this test suite 
needs. 'dd' is included in the Git for Windows SDK. I rebased this 
series onto Git for Windows and the tests passed when run in an SDK shell.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (19 preceding siblings ...)
  2018-05-24 16:26       ` [PATCH v3 20/20] commit-graph: update design document Derrick Stolee
@ 2018-05-24 21:15       ` Ævar Arnfjörð Bjarmason
  2018-05-25  4:11       ` Junio C Hamano
  2018-05-29  4:27       ` Junio C Hamano
  22 siblings, 0 replies; 149+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-05-24 21:15 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, jnareb\, stolee\, marten.agren\, peff\


On Thu, May 24 2018, Derrick Stolee wrote:

> Thanks for all the feedback on v2. I've tried to make this round's
> review a bit easier by splitting up the commits into smaller pieces.
> Also, the test script now has less boilerplate and uses variables and
> clear arithmetic to explain which bytes are being modified.

Thanks. it's a lot easier.

> One other change worth mentioning: in "commit-graph: add '--reachable'
> option" I put the ref-iteration into a new external
> 'write_commit_graph_reachable()' method inside commit-graph.c. This
> makes the 'gc: automatically write commit-graph files' a simpler change.

Maybe you want this, maybe not, but I came up with this to squash:

    diff --git a/Documentation/config.txt b/Documentation/config.txt
    index 9a3abd87e7..2665522385 100644
    --- a/Documentation/config.txt
    +++ b/Documentation/config.txt
    @@ -900,7 +900,8 @@ the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].

     core.commitGraph::
            Enable git commit graph feature. Allows reading from the
    -       commit-graph file.
    +       commit-graph file. See `gc.commitGraph` for automatically
    +       maintaining the file.

     core.sparseCheckout::
            Enable "sparse checkout" feature. See section "Sparse checkout" in
    @@ -1554,10 +1555,10 @@ gc.autoDetach::
            if the system supports it. Default is true.

     gc.commitGraph::
    -       If true, then gc will rewrite the commit-graph file after any
    -       change to the object database. If '--auto' is used, then the
    -       commit-graph will not be updated unless the threshold is met.
    -       See linkgit:git-commit-graph[1] for details.
    +       If true, then gc will rewrite the commit-graph file when
    +       linkgit:git-gc[1] is run. When using linkgit:git-gc[1]
    +       '--auto' the commit-graph will be updated if housekeeping is
    +       required. See linkgit:git-commit-graph[1] for details.

     gc.logExpiry::
            If the file gc.log exists, then `git gc --auto` won't run

I.e. let's mention the new gc.commitGraph in core.commitGraph, and I
think the "any change to the object database" line in gc.commitGraph is
needlessly confusing, let's just say "when git-gc is run".

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 01/20] commit-graph: UNLEAK before die()
  2018-05-24 16:25       ` [PATCH v3 01/20] commit-graph: UNLEAK before die() Derrick Stolee
@ 2018-05-24 22:47         ` Stefan Beller
  2018-05-25  0:08           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Stefan Beller @ 2018-05-24 22:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, jnareb, stolee, avarab, marten.agren, peff

On Thu, May 24, 2018 at 9:25 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/commit-graph.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 37420ae0fd..f0875b8bf3 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv)
>         graph_name = get_commit_graph_filename(opts.obj_dir);
>         graph = load_commit_graph_one(graph_name);
>
> -       if (!graph)
> +       if (!graph) {
> +               UNLEAK(graph_name);
>                 die("graph file %s does not exist", graph_name);

Unrelated to this patch: Is the command that ends up die()ing here
a plumbing or porcelain, or: Do we want to translate the message here?

In a lot of commands that show paths we single quote them '%s',
(speaking from experience with a lot of submodule path code)

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 01/20] commit-graph: UNLEAK before die()
  2018-05-24 22:47         ` Stefan Beller
@ 2018-05-25  0:08           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-25  0:08 UTC (permalink / raw)
  To: Stefan Beller, Derrick Stolee
  Cc: git, gitster, jnareb, avarab, marten.agren, peff

On 5/24/2018 6:47 PM, Stefan Beller wrote:
> On Thu, May 24, 2018 at 9:25 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   builtin/commit-graph.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
>> index 37420ae0fd..f0875b8bf3 100644
>> --- a/builtin/commit-graph.c
>> +++ b/builtin/commit-graph.c
>> @@ -51,8 +51,11 @@ static int graph_read(int argc, const char **argv)
>>          graph_name = get_commit_graph_filename(opts.obj_dir);
>>          graph = load_commit_graph_one(graph_name);
>>
>> -       if (!graph)
>> +       if (!graph) {
>> +               UNLEAK(graph_name);
>>                  die("graph file %s does not exist", graph_name);
> Unrelated to this patch: Is the command that ends up die()ing here
> a plumbing or porcelain, or: Do we want to translate the message here?
>
> In a lot of commands that show paths we single quote them '%s',
> (speaking from experience with a lot of submodule path code)

This is for the 'git commit-graph read' command, which is plumbing (and 
'read' is really only for testing). I don't think this message requires 
translation.

I'll keep the quotes in mind for the future.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (20 preceding siblings ...)
  2018-05-24 21:15       ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Ævar Arnfjörð Bjarmason
@ 2018-05-25  4:11       ` Junio C Hamano
  2018-05-29  4:27       ` Junio C Hamano
  22 siblings, 0 replies; 149+ messages in thread
From: Junio C Hamano @ 2018-05-25  4:11 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, jnareb\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> One other change worth mentioning: in "commit-graph: add '--reachable'
> option" I put the ref-iteration into a new external
> 'write_commit_graph_reachable()' method inside commit-graph.c. This
> makes the 'gc: automatically write commit-graph files' a simpler change.

;-).

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE
  2018-05-24 16:25       ` [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE Derrick Stolee
@ 2018-05-26 18:46         ` Jakub Narebski
  2018-05-26 20:30           ` brian m. carlson
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-26 18:46 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The GRAPH_MIN_SIZE macro should be the smallest size of a parsable
> commit-graph file. However, the minimum number of chunks was wrong.
> It is possible to write a commit-graph file with zero commits, and
> that violates this macro's value.
>
> Rewrite the macro, and use extra macros to better explain the magic
> constants.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index a8c337dd77..82295f0975 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -33,10 +33,11 @@
>  
>  #define GRAPH_LAST_EDGE 0x80000000
>  
> +#define GRAPH_HEADER_SIZE 8

Nice.

>  #define GRAPH_FANOUT_SIZE (4 * 256)
>  #define GRAPH_CHUNKLOOKUP_WIDTH 12
> -#define GRAPH_MIN_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH + GRAPH_FANOUT_SIZE + \
> -			GRAPH_OID_LEN + 8)
> +#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
> +			+ GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)

So in this case we have file header (4-byte signature, 1-byte version
number, 1-byte oid/hash version, 1-byte number of chunks, 1-byte
reserved: 4+1+1+1+1 = 8 bytes), chunk lookup: 3 required chunks plus
terminating label = 4 entries, constant-size fanout chunks, and
checksum.  Two remaining required chunks (OID Lookup and Commit Data)
can have length of 0.


One issue: in the future when Git moves to NewHash, it could encounter
then both commit-graph files using SHA-1 and using NewHash.  What about
GRPH_OID_LEN then: for one of those it would be incorrect.  Unless it is
about minimal length of checksum, that is we assume that NewHash would
be longer than SHA-1, but ten why name it GRAPH_OID_LEN?

This may be going too much in the future; there is no need to borrow
trouble now, where we have only SHA-1 supported as OID.  Still...

>  
>  char *get_commit_graph_filename(const char *obj_dir)
>  {

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE
  2018-05-26 18:46         ` Jakub Narebski
@ 2018-05-26 20:30           ` brian m. carlson
  2018-06-02 19:43             ` Jakub Narebski
  0 siblings, 1 reply; 149+ messages in thread
From: brian m. carlson @ 2018-05-26 20:30 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, gitster, stolee, avarab, marten.agren, peff

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

On Sat, May 26, 2018 at 08:46:09PM +0200, Jakub Narebski wrote:
> One issue: in the future when Git moves to NewHash, it could encounter
> then both commit-graph files using SHA-1 and using NewHash.  What about
> GRPH_OID_LEN then: for one of those it would be incorrect.  Unless it is
> about minimal length of checksum, that is we assume that NewHash would
> be longer than SHA-1, but ten why name it GRAPH_OID_LEN?

My proposal is that whatever we're using in the .git directory is
consistent.  If we're using SHA-1 for objects, then everything is SHA-1.
If we're using NewHash for objects, then all data is stored in NewHash
(except translation tables and such).  Any conversions between SHA-1 and
NewHash require a lookup through the standard techniques.

I agree that here it would be more helpful if it were a reference to
the_hash_algo, and I've applied a patch to my object-id-part14 series to
make that conversion.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 03/20] commit-graph: parse commit from chosen graph
  2018-05-24 16:25       ` [PATCH v3 03/20] commit-graph: parse commit from chosen graph Derrick Stolee
@ 2018-05-27 10:23         ` Jakub Narebski
  2018-05-29 12:31           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-27 10:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> Before verifying a commit-graph file against the object database, we
> need to parse all commits from the given commit-graph file. Create
> parse_commit_in_graph_one() to target a given struct commit_graph.

If I understand it properly the problem is that when verifying against
the object database we want to check one single commit-graph file, not
concatenation of data from commit-graph file for the repository and
commit-graph files from its alternates -- like prepare_commit_graph()
does; which is called by parse_commit_in_graph().

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

O.K., so you introduce here a layer of indirection; parse_commit_in_graph()
now just uses parse_commit_in_graph_one(), passing core_commit_graph
(or the_commit_graph) to it, after checking that core_commit_graph is set
(which handles the case when feature is not turned off) and loading
commit-graph file.

Nice and simple 'split function' refactoring, with new function taking
over when there is commit graph file prepared.


So, after the changes:
* parse_commit_in_graph() is responsible for checking whether to use
  commit-graph feature and ensuring that data from commit-graph is
  loaded, where it passes the control to parse_commit_in_graph_one()
* parse_commit_in_graph_one() checks whether commit-graph feature is
  turned on, whether commit we are interested in was already parsed,
  and then uses fill_commit_in_graph() to actually get the data
* fill_commit_in_graph() gets data out of commit-graph file, extracting
  it from commit data chunk (and if needed large edges chunk).

All those functions return 1 if they got data from commit-graph, and 0
if they didn't.


One minor nitpick / complaint / question is about naming of global
variables used here, namely:
* static struct commit_graph *commit_graph
  from commit-graph.c for global storage of commit-graph[s] data
* int core_commit_graph
  from environment.c for storing core.commitGraph config

But I see that at least the latter is common convention in Git source
code; I guess that the former maybe follows convention as used for "the
index" and "the repository" - additionally it is static / file-local.

> ---
>  commit-graph.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 82295f0975..78ba0edc80 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	}
>  }
>  
> -int parse_commit_in_graph(struct commit *item)
> +static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
>  {
>  	uint32_t pos;
>  
> @@ -318,9 +318,21 @@ static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
>  	if (!core_commit_graph)
>  		return 0;

All right, we check that commit-graph feature is enabled because
parse_commit_in_graph_one() will be used standalone, not by being
invoked from parse_commit_in_graph().

This check is fast.

>  	if (item->object.parsed)
>  		return 1;

Sidenote: I just wonder why object.parsed and not for example
object.graph_pos is used to checck whether the object was filled if
possible with commit-graph data...

> +
> +	if (find_commit_in_graph(item, g, &pos))
> +		return fill_commit_in_graph(item, g, pos);
> +
> +	return 0;
> +}
> +
> +int parse_commit_in_graph(struct commit *item)
> +{
> +	if (!core_commit_graph)
> +		return 0;

All right, this check is here to short-circuit and make it so git does
not even try to lod commit-graph file[s] if the feature is disabled.

> +
>  	prepare_commit_graph();
> -	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
> -		return fill_commit_in_graph(item, commit_graph, pos);
> +	if (commit_graph)
> +		return parse_commit_in_graph_one(commit_graph, item);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 04/20] commit: force commit to parse from object database
  2018-05-24 16:25       ` [PATCH v3 04/20] commit: force commit to parse from object database Derrick Stolee
@ 2018-05-27 18:04         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-05-27 18:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> In anticipation of verifying commit-graph file contents against the
> object database, create parse_commit_internal() to allow side-stepping
> the commit-graph file and parse directly from the object database.
>
> Due to the use of generation numbers, this method should not be called
> unless the intention is explicit in avoiding commits from the
> commit-graph file.

A straightforward addition of a parameter to parse_commit() and renaming
it to parse_commit_internal(), while changing parse_commit() to be a
simple wrapper around newly introduced parse_commit_internal(), passing
the default arguments.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 9 +++++++--
>  commit.h | 1 +
>  2 files changed, 8 insertions(+), 2 deletions(-)

Nice and simple refactoring in preparation for future changes.

>
> diff --git a/commit.c b/commit.c
> index 1d28677dfb..6eaed0174c 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -392,7 +392,7 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
>  	return 0;
>  }
>  
> -int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)

I guess that the "new" parse_commit_internal() function was not made
static despite the *_internal() in the name because it would need to be
used from commit-graph.c, isn't it?

I don't think we would need more similar options, so *_with_flags()
would be YAGNI overkill.

>  {
>  	enum object_type type;
>  	void *buffer;
> @@ -403,7 +403,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (parse_commit_in_graph(item))
> +	if (use_commit_graph && parse_commit_in_graph(item))
>  		return 0;
>  	buffer = read_sha1_file(item->object.oid.hash, &type, &size);
>  	if (!buffer)
> @@ -424,6 +424,11 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
>  	return ret;
>  }
>  
> +int parse_commit_gently(struct commit *item, int quiet_on_missing)
> +{
> +	return parse_commit_internal(item, quiet_on_missing, 1);
> +}
> +
>  void parse_commit_or_die(struct commit *item)
>  {
>  	if (parse_commit(item))
> diff --git a/commit.h b/commit.h
> index b5afde1ae9..5fde74fcd7 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
>  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
>  
>  int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
> +int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph);
>  int parse_commit_gently(struct commit *item, int quiet_on_missing);
>  static inline int parse_commit(struct commit *item)
>  {

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 05/20] commit-graph: load a root tree from specific graph
  2018-05-24 16:25       ` [PATCH v3 05/20] commit-graph: load a root tree from specific graph Derrick Stolee
@ 2018-05-27 19:12         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-05-27 19:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> When lazy-loading a tree for a commit, it will be important to select
> the tree from a specific struct commit_graph. Create a new method that
> specifies the commit-graph file and use that in
> get_commit_tree_in_graph().

Is this for the same reason why parse_commit_in_graph_one() was created
in ptch 03/20?  Why it would be important?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)

Simple and straightforward refactoring in the same vein as
parse_commit_in_graph_one() one.

>
> diff --git a/commit-graph.c b/commit-graph.c
> index 78ba0edc80..25893ec096 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -358,14 +358,20 @@ static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *
>  	return c->maybe_tree;
>  }
>  
> -struct tree *get_commit_tree_in_graph(const struct commit *c)
> +static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
> +						 const struct commit *c)
>  {
>  	if (c->maybe_tree)
>  		return c->maybe_tree;
>  	if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
> -		BUG("get_commit_tree_in_graph called from non-commit-graph commit");
> +		BUG("get_commit_tree_in_graph_one called from non-commit-graph commit");

Sidenote: I wonder if it would be better or worse to use __func__ magic
costant variable here (part of C99 and C++11 standards).

> +
> +	return load_tree_for_commit(g, (struct commit *)c);
> +}
>  
> -	return load_tree_for_commit(commit_graph, (struct commit *)c);
> +struct tree *get_commit_tree_in_graph(const struct commit *c)
> +{
> +	return get_commit_tree_in_graph_one(commit_graph, c);
>  }
>  
>  static void write_graph_chunk_fanout(struct hashfile *f,

Looks good to me.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand
  2018-05-24 16:25       ` [PATCH v3 06/20] commit-graph: add 'verify' subcommand Derrick Stolee
@ 2018-05-27 22:55         ` Jakub Narebski
  2018-05-30 16:07           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-27 22:55 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> If the commit-graph file becomes corrupt, we need a way to verify
> that its contents match the object database. In the manner of
> 'git fsck' we will implement a 'git commit-graph verify' subcommand
> to report all issues with the file.
>
> Add the 'verify' subcommand to the 'commit-graph' builtin and its
> documentation. The subcommand is currently a no-op except for
> loading the commit-graph into memory, which may trigger run-time
> errors that would be caught by normal use.

So this commit is simply getting the boilerplate out of the way for
implementing 'git commit-graph verify' subcommand.  Good.

>                                            Add a simple test that
> ensures the command returns a zero error code.

Nice.

>
> If no commit-graph file exists, this is an acceptable state. Do
> not report any errors.

All right.  I assume that as it is explicit verification call, it does
ignore core.commitGraph setting, isn't it?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |  6 ++++++
>  builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++++++++++
>  commit-graph.c                     | 26 ++++++++++++++++++++++++++
>  commit-graph.h                     |  2 ++
>  t/t5318-commit-graph.sh            | 10 ++++++++++
>  5 files changed, 82 insertions(+)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 4c97b555cc..a222cfab08 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -10,6 +10,7 @@ SYNOPSIS
>  --------
>  [verse]
>  'git commit-graph read' [--object-dir <dir>]
> +'git commit-graph verify' [--object-dir <dir>]
>  'git commit-graph write' <options> [--object-dir <dir>]

In alphabetical order, good.

>  
>  
> @@ -52,6 +53,11 @@ existing commit-graph file.
>  Read a graph file given by the commit-graph file and output basic
>  details about the graph file. Used for debugging purposes.
>  
> +'verify'::
> +
> +Read the commit-graph file and verify its contents against the object
> +database. Used to check for corrupted data.
> +

All right, good enough description.

>  
>  EXAMPLES
>  --------
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index f0875b8bf3..0433dd6e20 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -8,10 +8,16 @@
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--object-dir <objdir>]"),
>  	N_("git commit-graph read [--object-dir <objdir>]"),
> +	N_("git commit-graph verify [--object-dir <objdir>]"),
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>  	NULL
>  };

In alphabetical order, same as in the manpage for git-commit-graph.

>  
> +static const char * const builtin_commit_graph_verify_usage[] = {
> +	N_("git commit-graph verify [--object-dir <objdir>]"),
> +	NULL
> +};
> +
>  static const char * const builtin_commit_graph_read_usage[] = {
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	NULL
> @@ -29,6 +35,36 @@ static struct opts_commit_graph {
>  	int append;
>  } opts;
>  
> +
> +static int graph_verify(int argc, const char **argv)
> +{
> +	struct commit_graph *graph = 0;
> +	char *graph_name;
> +
> +	static struct option builtin_commit_graph_verify_options[] = {
> +		OPT_STRING(0, "object-dir", &opts.obj_dir,
> +			   N_("dir"),
> +			   N_("The object directory to store the graph")),
> +		OPT_END(),
> +	};
> +
> +	argc = parse_options(argc, argv, NULL,
> +			     builtin_commit_graph_verify_options,
> +			     builtin_commit_graph_verify_usage, 0);
> +
> +	if (!opts.obj_dir)
> +		opts.obj_dir = get_object_directory();

Getting the boilerplate of implementing the command mostly out of the
way.  Good.

> +
> +	graph_name = get_commit_graph_filename(opts.obj_dir);
> +	graph = load_commit_graph_one(graph_name);

So we are verifying only the commit-graph file belonging directly to
current repository, as I have expected.  This is needed to for warnings
and error messages from the 'verify' action, to be able to tell in which
file there are problems.

This means that it is possible that there would be problems with
commit-graph files that running 'git commit-graph verify' would not
find, because they are in commit-graph file in one of the alternates.

It is very easy, though, to check all commit-graph files that would be
read and its data concatenated when using commit-graph feature
(e.g. 'git commit-graph read', IIRC):

  $ git commit-graph verify
  $ for obj_dir in $(cat .git/objects/info/alternates) do;
        git commit-graph --object-dir="$obj_dir";
    done

Note: I have not checked the above that it works.

> +	FREE_AND_NULL(graph_name);

Freeing the resources, always nice to have.

> +
> +	if (!graph)
> +		return 0;

DS> If no commit-graph file exists, this is an acceptable state. Do
DS> not report any errors.

Right, non existant commit-graph file is certainly valid ;-)

> +
> +	return verify_commit_graph(graph);
> +}

I guess that graph_verify() would not change much, if at all, in
subsequent commits in this patch series.

> +
>  static int graph_read(int argc, const char **argv)
>  {
>  	struct commit_graph *graph = NULL;
> @@ -163,6 +199,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  			     PARSE_OPT_STOP_AT_NON_OPTION);
>  
>  	if (argc > 0) {
> +		if (!strcmp(argv[0], "verify"))
> +			return graph_verify(argc, argv);
>  		if (!strcmp(argv[0], "read"))
>  			return graph_read(argc, argv);
>  		if (!strcmp(argv[0], "write"))

Not in alphabetical order... is there a reason for that?

> diff --git a/commit-graph.c b/commit-graph.c
> index 25893ec096..55b41664ee 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -836,3 +836,29 @@ void write_commit_graph(const char *obj_dir,
>  	oids.alloc = 0;
>  	oids.nr = 0;
>  }
> +
> +static int verify_commit_graph_error;
> +
> +static void graph_report(const char *fmt, ...)
> +{
> +	va_list ap;
> +	struct strbuf sb = STRBUF_INIT;
> +	verify_commit_graph_error = 1;
> +
> +	va_start(ap, fmt);
> +	strbuf_vaddf(&sb, fmt, ap);
> +
> +	fprintf(stderr, "%s\n", sb.buf);
> +	strbuf_release(&sb);
> +	va_end(ap);

Why do you use strbuf_vaddf + fprintf instead of straighforward
vfprintf (or function instead of variable-level macro)?

Is it because of [string] safety?

> +}
> +
> +int verify_commit_graph(struct commit_graph *g)
> +{
> +	if (!g) {
> +		graph_report("no commit-graph file loaded");
> +		return 1;
> +	}

All right, this is just a placeholder - we should not ever get this
message because in this case we exit with error code of 0 (EXIT_SUCCESS)
if there is no commit-graph file loaded before invoking
verify_commit_graph().

> +
> +	return verify_commit_graph_error;

All right, this is for the future.  Good.

> +}
> diff --git a/commit-graph.h b/commit-graph.h
> index 96cccb10f3..71a39c5a57 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
>  			int nr_commits,
>  			int append);
>  
> +int verify_commit_graph(struct commit_graph *g);
> +

Why does this need to be exported?  I think it is not used outside of
commit-graph.c, isn't it?

>  #endif
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 77d85aefe7..6ca451dfd2 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
>  	objdir=".git/objects"
>  '
>  
> +test_expect_success 'verify graph with no graph file' '
> +	cd "$TRASH_DIRECTORY/full" &&

Is sich bare `cd`, without corresponding `cd` back or using subshell
safe?

> +	git commit-graph verify
> +'
> +
>  test_expect_success 'write graph with no packs' '
>  	cd "$TRASH_DIRECTORY/full" &&
>  	git commit-graph write --object-dir . &&
> @@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
>  	test_cmp expect output
>  '
>  
> +test_expect_success 'git commit-graph verify' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph verify >output
> +'

Those are tests with nearly the same code, but they are (by their
descriptions) testing different things.  This means that they rely on
side effects of earlier tests.

This is suboptimal, as it means that it would be impossible or very
difficult to run individual tests (e.g. with GIT_SKIP_TESTS environment
variable, or with an individual test suite --run option), unless you
know which tests setup the repository state for later tests.

It also means that running only failed tests with prove
--state=failed,save or equivalently with

  $ make DEFAULT_TEST_TARGET=prove GIT_PROVE_OPTS='--state=failed,save' test

wouldn't work correctly.

As Johannes Schindelin (alias Dscho) said in latest Git Rev News
interview: https://git.github.io/rev_news/2018/05/16/edition-39/

JS> We have a test suite where debugging a regression may mean that you
JS> have to run 98 test cases before the failing one every single time in
JS> the edit/compile/debug cycle, because the 99th test case may depend on
JS> a side effect of at least one of the preceding test cases. Git’s test
JS> suite is so not [21st century best practices][1].
JS>
JS> [1]: https://www.slideshare.net/BuckHodges/lessons-learned-doing-devops-at-scale-at-microsoft


I think can be solved quite efficiently by creating and using shell
function, or two shell functions, which would either:

 * rename commit-graph file to some other temporary name if it exists,
   and move it back after the test.
 * create commit-graph file if it does not exist.

For example (untested):

  prepare_no_commit_graph() {
  	mv .git/info/commit-graph .git/info/commit-graph.away &&
  	test_when_finished "mv .git/info/commit-graph.away .git/info/commit-graph"
  }

  prepare_commit_graph() {
  	if ! test -f ".git/info/commit-graph"
  	then
  		git commit-graph write
  	fi
  }

Or something like that.

> +
>  test_done

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 07/20] commit-graph: verify catches corrupt signature
  2018-05-24 16:25       ` [PATCH v3 07/20] commit-graph: verify catches corrupt signature Derrick Stolee
@ 2018-05-28 14:05         ` Jakub Narebski
  2018-05-29 12:43           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-28 14:05 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> This is the first of several commits that add a test to check that
> 'git commit-graph verify' catches corruption in the commit-graph
> file. The first test checks that the command catches an error in
> the file signature. This is a check that exists in the existing
> commit-graph reading code.

Good start.

This handles 3 out of 5 checks in load_commit_graph_one().  The
remaining are:
* too short file (length smaller than minimal commit-graph size)
* more than one chunk of one of 4 defined types

> Add a helper method 'corrupt_graph_and_verify' to the test script
> t5318-commit-graph.sh. This helper corrupts the commit-graph file
> at a certain location, runs 'git commit-graph verify', and reports
> the output to the 'err' file. This data is filtered to remove the
> lines added by 'test_must_fail' when the test is run verbosely.
> Then, the output is checked to contain a specific error message.

Thanks for an explanation.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t5318-commit-graph.sh | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
>
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 6ca451dfd2..bd64481c7a 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' '
>  	test_cmp expect output
>  '
>  
> +# the verify tests below expect the commit-graph to contain
> +# exactly the commits reachable from the commits/8 branch.
> +# If the file changes the set of commits in the list, then the
> +# offsets into the binary file will result in different edits
> +# and the tests will likely break.
> +
>  test_expect_success 'git commit-graph verify' '
>  	cd "$TRASH_DIRECTORY/full" &&
> +	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
>  	git commit-graph verify >output
>  '

I don't quite understand what the change is meant to do.

Also, as I said earlier, I'd prefer if tests were as indpendent of each
other as possible, to make running individual tests (e.g. only
previously falling tests) easier.

I especially do not like mixing running actual test with setting up the
repository for future tests, as here.

>  
> +GRAPH_BYTE_VERSION=4
> +GRAPH_BYTE_HASH=5
> +
> +# usage: corrupt_graph_and_verify <position> <data> <string>
> +# Manipulates the commit-graph file at the position
> +# by inserting the data, then runs 'git commit-graph verify'
> +# and places the output in the file 'err'. Test 'err' for
> +# the given string.

Very nice to have this description.

> +corrupt_graph_and_verify() {
> +	pos=$1
> +	data="${2:-\0}"
> +	grepstr=$3
> +	cd "$TRASH_DIRECTORY/full" &&
> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
> +	cp $objdir/info/commit-graph commit-graph-backup &&
> +	printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&

Using 'printf' with octal is much more portable than relying on 'echo'
supporting octal escape sequences (or supporting escape sequences at
all).

> +	test_must_fail git commit-graph verify 2>test_err &&
> +	grep -v "^+" test_err >err
> +	grep "$grepstr" err

Shouldn't this last 'grep' be 'test_i18ngrep' instead, to allow for
translated messages from 'git commit-graph verify' / 'git fsck'?

> +}

This function makes actual tests short and simple, without duplicated
code.  Very good.

> +
> +test_expect_success 'detect bad signature' '
> +	corrupt_graph_and_verify 0 "\0" \
> +		"graph signature"
> +'
> +
> +test_expect_success 'detect bad version' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
> +		"graph version"
> +'
> +
> +test_expect_success 'detect bad hash version' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \

When we move from SHA-1 (hash version 1) to NewHash (hash version 2),
this test would start failing... which is actually not a bad idea.

> +		"hash version"
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 08/20] commit-graph: verify required chunks are present
  2018-05-24 16:25       ` [PATCH v3 08/20] commit-graph: verify required chunks are present Derrick Stolee
@ 2018-05-28 17:11         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-05-28 17:11 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Martin Ågren,
	Jeff King

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file requires the following three chunks:
>
> * OID Fanout
> * OID Lookup
> * Commit Data
>
> If any of these are missing, then the 'verify' subcommand should
> report a failure. This includes the chunk IDs malformed or the
> chunk count is truncated.

Minor nit: it should IMVHO either be "or the chunk count truncated", or
"or when the chunk count is truncated".

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          |  9 +++++++++
>  t/t5318-commit-graph.sh | 29 +++++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 55b41664ee..06e3e4f9ba 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -860,5 +860,14 @@ int verify_commit_graph(struct commit_graph *g)
>  		return 1;
>  	}
>  
> +	verify_commit_graph_error = 0;
> +

By the way, if chunk count is less than 3, then by pigeonhole principle
at least one required chunk is missing.

> +	if (!g->chunk_oid_fanout)
> +		graph_report("commit-graph is missing the OID Fanout chunk");
> +	if (!g->chunk_oid_lookup)
> +		graph_report("commit-graph is missing the OID Lookup chunk");
> +	if (!g->chunk_commit_data)
> +		graph_report("commit-graph is missing the Commit Data chunk");

Nice and simple.  Good.

> +
>  	return verify_commit_graph_error;
>  }
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index bd64481c7a..4ef3fe3dc2 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -249,6 +249,15 @@ test_expect_success 'git commit-graph verify' '
>  
>  GRAPH_BYTE_VERSION=4
>  GRAPH_BYTE_HASH=5
> +GRAPH_BYTE_CHUNK_COUNT=6
> +GRAPH_CHUNK_LOOKUP_OFFSET=8
> +GRAPH_CHUNK_LOOKUP_WIDTH=12
> +GRAPH_CHUNK_LOOKUP_ROWS=5
> +GRAPH_BYTE_OID_FANOUT_ID=$GRAPH_CHUNK_LOOKUP_OFFSET
> +GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
> +			      1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
> +GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
> +				2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -283,4 +292,24 @@ test_expect_success 'detect bad hash version' '
>  		"hash version"
>  '
>  
> +test_expect_success 'detect bad chunk count' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\02" \
> +		"missing the Commit Data chunk"
> +'

As I wrote before, this test assumes that the last chunk (the one not
counted because of changed / corrupted chunk count) is the Commit Data
chunk.  This may be true for corrent implementation, but it is not
required by the format.

Better solution would be to check for "missing the .* chunk"; as I
understand you can pass the regexp to grep, not only strings.


Another thing would be to check if there are gaps in the file, or if the
whole file is being used.  Changing chunk count to a smaller number
would mean that chunks would not cover the rest of files.

By the way, would the following be detected:

  	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\05"

that is corrupting the chunk count to be larger than the number of
actual chunks?  Or is it left for later?

> +
> +test_expect_success 'detect missing OID fanout chunk' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_FANOUT_ID "\0" \

We could have used "X" or " " in place of "\0", but admittedly the
latter is a better check - it also checks if there are problems with
handling of NUL character ("\0") in chunk names.

> +		"missing the OID Fanout chunk"
> +'
> +
> +test_expect_success 'detect missing OID lookup chunk' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ID "\0" \
> +		"missing the OID Lookup chunk"
> +'
> +
> +test_expect_success 'detect missing commit data chunk' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATA_ID "\0" \
> +		"missing the Commit Data chunk"
> +'

What happens if the terminating pseudo-chunk name "\0\0\0\0" gets
corrupted?  Would it be detected (or maybe it is handled by later patch
in the series)?

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
                         ` (21 preceding siblings ...)
  2018-05-25  4:11       ` Junio C Hamano
@ 2018-05-29  4:27       ` Junio C Hamano
  2018-05-29 12:37         ` Derrick Stolee
  22 siblings, 1 reply; 149+ messages in thread
From: Junio C Hamano @ 2018-05-29  4:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, jnareb\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> Thanks for all the feedback on v2. I've tried to make this round's
> review a bit easier by splitting up the commits into smaller pieces.
> Also, the test script now has less boilerplate and uses variables and
> clear arithmetic to explain which bytes are being modified.
>
> One other change worth mentioning: in "commit-graph: add '--reachable'
> option" I put the ref-iteration into a new external
> 'write_commit_graph_reachable()' method inside commit-graph.c. This
> makes the 'gc: automatically write commit-graph files' a simpler change.

I finally managed to find time to resolve conflicts this topic has
with other topics (of your own included, if I am not mistaken).
Please double check the resolution when I push out the day's
integration result later today.

Thanks.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 03/20] commit-graph: parse commit from chosen graph
  2018-05-27 10:23         ` Jakub Narebski
@ 2018-05-29 12:31           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-29 12:31 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 5/27/2018 6:23 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Before verifying a commit-graph file against the object database, we
>> need to parse all commits from the given commit-graph file. Create
>> parse_commit_in_graph_one() to target a given struct commit_graph.
> If I understand it properly the problem is that when verifying against
> the object database we want to check one single commit-graph file, not
> concatenation of data from commit-graph file for the repository and
> commit-graph files from its alternates -- like prepare_commit_graph()
> does; which is called by parse_commit_in_graph().
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> O.K., so you introduce here a layer of indirection; parse_commit_in_graph()
> now just uses parse_commit_in_graph_one(), passing core_commit_graph
> (or the_commit_graph) to it, after checking that core_commit_graph is set
> (which handles the case when feature is not turned off) and loading
> commit-graph file.
>
> Nice and simple 'split function' refactoring, with new function taking
> over when there is commit graph file prepared.
>
>
> So, after the changes:
> * parse_commit_in_graph() is responsible for checking whether to use
>    commit-graph feature and ensuring that data from commit-graph is
>    loaded, where it passes the control to parse_commit_in_graph_one()
> * parse_commit_in_graph_one() checks whether commit-graph feature is
>    turned on, whether commit we are interested in was already parsed,
>    and then uses fill_commit_in_graph() to actually get the data
> * fill_commit_in_graph() gets data out of commit-graph file, extracting
>    it from commit data chunk (and if needed large edges chunk).
>
> All those functions return 1 if they got data from commit-graph, and 0
> if they didn't.
>
>
> One minor nitpick / complaint / question is about naming of global
> variables used here, namely:
> * static struct commit_graph *commit_graph
>    from commit-graph.c for global storage of commit-graph[s] data
> * int core_commit_graph
>    from environment.c for storing core.commitGraph config
>
> But I see that at least the latter is common convention in Git source
> code; I guess that the former maybe follows convention as used for "the
> index" and "the repository" - additionally it is static / file-local.

See also `struct packed_git *packed_git;` from cache.h.

>
>> ---
>>   commit-graph.c | 18 +++++++++++++++---
>>   1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 82295f0975..78ba0edc80 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -310,7 +310,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>>   	}
>>   }
>>   
>> -int parse_commit_in_graph(struct commit *item)
>> +static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
>>   {
>>   	uint32_t pos;
>>   
>> @@ -318,9 +318,21 @@ static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
>>   	if (!core_commit_graph)
>>   		return 0;
> All right, we check that commit-graph feature is enabled because
> parse_commit_in_graph_one() will be used standalone, not by being
> invoked from parse_commit_in_graph().
>
> This check is fast.
>
>>   	if (item->object.parsed)
>>   		return 1;
> Sidenote: I just wonder why object.parsed and not for example
> object.graph_pos is used to checck whether the object was filled if
> possible with commit-graph data...

Here we are filling all data, including commit date and parents. We use 
load_commit_graph_info() when we only need the graph_pos and generation 
values.

>
>> +
>> +	if (find_commit_in_graph(item, g, &pos))
>> +		return fill_commit_in_graph(item, g, pos);
>> +
>> +	return 0;
>> +}
>> +
>> +int parse_commit_in_graph(struct commit *item)
>> +{
>> +	if (!core_commit_graph)
>> +		return 0;
> All right, this check is here to short-circuit and make it so git does
> not even try to lod commit-graph file[s] if the feature is disabled.
>
>> +
>>   	prepare_commit_graph();
>> -	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
>> -		return fill_commit_in_graph(item, commit_graph, pos);
>> +	if (commit_graph)
>> +		return parse_commit_in_graph_one(commit_graph, item);
>>   	return 0;
>>   }


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-29  4:27       ` Junio C Hamano
@ 2018-05-29 12:37         ` Derrick Stolee
  2018-05-29 13:41           ` Junio C Hamano
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-29 12:37 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee; +Cc: git, jnareb, avarab, marten.agren, peff

On 5/29/2018 12:27 AM, Junio C Hamano wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> Thanks for all the feedback on v2. I've tried to make this round's
>> review a bit easier by splitting up the commits into smaller pieces.
>> Also, the test script now has less boilerplate and uses variables and
>> clear arithmetic to explain which bytes are being modified.
>>
>> One other change worth mentioning: in "commit-graph: add '--reachable'
>> option" I put the ref-iteration into a new external
>> 'write_commit_graph_reachable()' method inside commit-graph.c. This
>> makes the 'gc: automatically write commit-graph files' a simpler change.
> I finally managed to find time to resolve conflicts this topic has
> with other topics (of your own included, if I am not mistaken).
> Please double check the resolution when I push out the day's
> integration result later today.

Sorry about the confusion. I was operating on my copy of 
ds/generation-numbers (34fdd43339) but fetching just now I see you 
updated that branch to 1472978ec6. I reset my local branch to 
ds/commit-graph-fsk (53dd1e6600). The only diff I see between my v3 
branch and that commit are the changes from ds/commit-graph-lockfile-fix 
(33286dcd6d).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 07/20] commit-graph: verify catches corrupt signature
  2018-05-28 14:05         ` Jakub Narebski
@ 2018-05-29 12:43           ` Derrick Stolee
  2018-06-02 22:30             ` Jakub Narebski
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-29 12:43 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 5/28/2018 10:05 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> This is the first of several commits that add a test to check that
>> 'git commit-graph verify' catches corruption in the commit-graph
>> file. The first test checks that the command catches an error in
>> the file signature. This is a check that exists in the existing
>> commit-graph reading code.
> Good start.
>
> This handles 3 out of 5 checks in load_commit_graph_one().  The
> remaining are:
> * too short file (length smaller than minimal commit-graph size)
> * more than one chunk of one of 4 defined types
>
>> Add a helper method 'corrupt_graph_and_verify' to the test script
>> t5318-commit-graph.sh. This helper corrupts the commit-graph file
>> at a certain location, runs 'git commit-graph verify', and reports
>> the output to the 'err' file. This data is filtered to remove the
>> lines added by 'test_must_fail' when the test is run verbosely.
>> Then, the output is checked to contain a specific error message.
> Thanks for an explanation.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   t/t5318-commit-graph.sh | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 43 insertions(+)
>>
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 6ca451dfd2..bd64481c7a 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' '
>>   	test_cmp expect output
>>   '
>>   
>> +# the verify tests below expect the commit-graph to contain
>> +# exactly the commits reachable from the commits/8 branch.
>> +# If the file changes the set of commits in the list, then the
>> +# offsets into the binary file will result in different edits
>> +# and the tests will likely break.
>> +
>>   test_expect_success 'git commit-graph verify' '
>>   	cd "$TRASH_DIRECTORY/full" &&
>> +	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
>>   	git commit-graph verify >output
>>   '
> I don't quite understand what the change is meant to do.

This gives us a constant commit-graph file to work with in the later tests.

To get the "independent test" structure you want for the tests that are 
coming, we need to do one of the following:

1. Write a new commit-graph file for every test (slows things down).
2. Do all corruption/verify checks in a single test (reduces the 
information from a failed test, as it only reports the first failure).

I don't like either of these options, so I went with this "prepare" step.

> Also, as I said earlier, I'd prefer if tests were as indpendent of each
> other as possible, to make running individual tests (e.g. only
> previously falling tests) easier.
>
> I especially do not like mixing running actual test with setting up the
> repository for future tests, as here.
>
>>   
>> +GRAPH_BYTE_VERSION=4
>> +GRAPH_BYTE_HASH=5
>> +
>> +# usage: corrupt_graph_and_verify <position> <data> <string>
>> +# Manipulates the commit-graph file at the position
>> +# by inserting the data, then runs 'git commit-graph verify'
>> +# and places the output in the file 'err'. Test 'err' for
>> +# the given string.
> Very nice to have this description.
>
>> +corrupt_graph_and_verify() {
>> +	pos=$1
>> +	data="${2:-\0}"
>> +	grepstr=$3
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	test_when_finished mv commit-graph-backup $objdir/info/commit-graph &&
>> +	cp $objdir/info/commit-graph commit-graph-backup &&
>> +	printf "$data" | dd of="$objdir/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
> Using 'printf' with octal is much more portable than relying on 'echo'
> supporting octal escape sequences (or supporting escape sequences at
> all).
>
>> +	test_must_fail git commit-graph verify 2>test_err &&
>> +	grep -v "^+" test_err >err
>> +	grep "$grepstr" err
> Shouldn't this last 'grep' be 'test_i18ngrep' instead, to allow for
> translated messages from 'git commit-graph verify' / 'git fsck'?
>
>> +}
> This function makes actual tests short and simple, without duplicated
> code.  Very good.
>
>> +
>> +test_expect_success 'detect bad signature' '
>> +	corrupt_graph_and_verify 0 "\0" \
>> +		"graph signature"
>> +'
>> +
>> +test_expect_success 'detect bad version' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
>> +		"graph version"
>> +'
>> +
>> +test_expect_success 'detect bad hash version' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \
> When we move from SHA-1 (hash version 1) to NewHash (hash version 2),
> this test would start failing... which is actually not a bad idea.
>
>> +		"hash version"
>> +'
>> +
>>   test_done


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc'
  2018-05-29 12:37         ` Derrick Stolee
@ 2018-05-29 13:41           ` Junio C Hamano
  0 siblings, 0 replies; 149+ messages in thread
From: Junio C Hamano @ 2018-05-29 13:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, jnareb\, avarab\, marten.agren\, peff\

Derrick Stolee <stolee@gmail.com> writes:

> On 5/29/2018 12:27 AM, Junio C Hamano wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
>>
>>> Thanks for all the feedback on v2. I've tried to make this round's
>>> review a bit easier by splitting up the commits into smaller pieces.
>>> Also, the test script now has less boilerplate and uses variables and
>>> clear arithmetic to explain which bytes are being modified.
>>>
>>> One other change worth mentioning: in "commit-graph: add '--reachable'
>>> option" I put the ref-iteration into a new external
>>> 'write_commit_graph_reachable()' method inside commit-graph.c. This
>>> makes the 'gc: automatically write commit-graph files' a simpler change.
>> I finally managed to find time to resolve conflicts this topic has
>> with other topics (of your own included, if I am not mistaken).
>> Please double check the resolution when I push out the day's
>> integration result later today.
>
> Sorry about the confusion. I was operating on my copy of
> ds/generation-numbers (34fdd43339) but fetching just now I see you
> updated that branch to 1472978ec6. I reset my local branch to
> ds/commit-graph-fsk (53dd1e6600). The only diff I see between my v3
> branch and that commit are the changes from
> ds/commit-graph-lockfile-fix (33286dcd6d).

Sorry for a confusing/confused comment.  The topic I had in mind
that I saw interactions with this one was the per-in-core-repo
allocation (sb/object-store-alloc), which was not yours.

In any case, what I wanted to see sanity checked was the result of
the merge into 'pu' (which needed some semantic adjustment), not
individual patches on the topic branch.  Relative to what we see at
the tip of 'pu' right now, what I'll be pushing out in 3 or 4 hours
will gain yet another semantic adjustment, so you may want to wait
a bit more to avoid duplicated work.

Thanks.

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 16/20] commit-graph: verify contents match checksum
  2018-05-24 16:26       ` [PATCH v3 16/20] commit-graph: verify contents match checksum Derrick Stolee
@ 2018-05-30 12:35         ` SZEDER Gábor
  2018-06-02 15:52         ` Jakub Narebski
  1 sibling, 0 replies; 149+ messages in thread
From: SZEDER Gábor @ 2018-05-30 12:35 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, gitster, jnareb, stolee, avarab,
	marten.agren, peff


> diff --git a/commit-graph.c b/commit-graph.c
> index d2b291aca2..a33600c584 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir,
>  	oids.nr =3D 0;
>  }
> =20
> +#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
>  static int verify_commit_graph_error;
> =20
>  static void graph_report(const char *fmt, ...)
> @@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...)
>  int verify_commit_graph(struct commit_graph *g)
>  {
>  	uint32_t i, cur_fanout_pos =3D 0;
> -	struct object_id prev_oid, cur_oid;
> +	struct object_id prev_oid, cur_oid, checksum;
> +	struct hashfile *f;
> +	int devnull;
> =20
>  	if (!g) {
>  		graph_report("no commit-graph file loaded");
> @@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g)
>  	if (verify_commit_graph_error)
>  		return verify_commit_graph_error;
> =20
> +	devnull =3D open("/dev/null", O_WRONLY);
> +	f =3D hashfd(devnull, NULL);
> +	hashwrite(f, g->data, g->data_len - g->hash_len);
> +	finalize_hashfile(f, checksum.hash, CSUM_CLOSE);
> +	if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) {
> +		graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));

This error message is translated ...

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 240aef6add..2680a2ebff 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh

> @@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopu=
> s merge' '
>  		"invalid parent"
>  '
> =20
> +test_expect_success 'detect invalid checksum hash' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
> +		"incorrect checksum"

... but here in 'corrupt_graph_and_verify' you look for "incorrect
checksum" with plain 'grep' (as opposed to 'test_i18ngrep', which
won't find that string in a GETTEXT_POISON build, and ultimately
causes the test to fail.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-05-24 16:25       ` [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup Derrick Stolee
@ 2018-05-30 13:34         ` Jakub Narebski
  2018-05-30 16:18           ` Derrick Stolee
  2018-06-02  4:38         ` Duy Nguyen
  1 sibling, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-30 13:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> In the commit-graph file, the OID fanout chunk provides an index into
> the OID lookup. The 'verify' subcommand should find incorrect values
> in the fanout.
>
> Similarly, the 'verify' subcommand should find out-of-order values in
> the OID lookup.

O.K., the OID Lookup chunk is checked together with OID Fanout chunk
because those two chunks are tightly connected: OID Fanout is fanout
into OID Lookup.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 36 ++++++++++++++++++++++++++++++++++++
>  t/t5318-commit-graph.sh | 22 ++++++++++++++++++++++
>  2 files changed, 58 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 06e3e4f9ba..cbd1aae514 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...)
>  
>  int verify_commit_graph(struct commit_graph *g)
>  {
> +	uint32_t i, cur_fanout_pos = 0;
> +	struct object_id prev_oid, cur_oid;

Minor nitpick about the naming convention: why cur_*, and not curr_*?

> +
>  	if (!g) {
>  		graph_report("no commit-graph file loaded");
>  		return 1;
> @@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g)
>  	if (!g->chunk_commit_data)
>  		graph_report("commit-graph is missing the Commit Data chunk");
>  
> +	if (verify_commit_graph_error)
> +		return verify_commit_graph_error;
> +

Before checking that commits in OID Lookup are sorted, and that OID
Fanout correctly points into OID Lookup (thus also checking that OID
Lookup is sorted), it would be good to verify that sizes of those chunks
are correct.

Out of 4 standrd chunks, 1 of them (OIDF) has constant size, and 2 of
them have size given by number of commits and hash size
 - OID Fanout is (256 * 4 bytes)
 - OID Lookup is (N * H bytes),
   where N is number of commits, and H is hash size

The one that is more significant is if number of commits gets corrupted
upwards, and walking through OID Lookup would lead us out of bounds,
outside the file size.

IIRC we have checked that all chunks fit within file size, isn't it?

> +	for (i = 0; i < g->num_commits; i++) {
> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);

Why do you use hashcpy() here, but oidcpy() below?

> +
> +		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)

All right, OIDs needs to be sorted in ascending lexicographical order,
and the above condition checks that this constraint is fullfilled. 

> +			graph_report("commit-graph has incorrect OID order: %s then %s",
> +				     oid_to_hex(&prev_oid),
> +				     oid_to_hex(&cur_oid));

Nice informative error message.

> +
> +		oidcpy(&prev_oid, &cur_oid);
> +
> +		while (cur_oid.hash[0] > cur_fanout_pos) {
> +			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
> +			if (i != fanout_value)
> +				graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
> +					     cur_fanout_pos, fanout_value, i);
> +
> +			cur_fanout_pos++;
> +		}

The k-th entry of fanout, F[k], stores the number of OIDs with first
byte at most k.

Let's examine it in detail on a simple example:

   fanout                     lookup
   ------                     ------
   0 : 0  ---------------> 0: 1xxxx....
   1 : 2  -----\           1: 1yyyy....
   2 : 2  ------\--------> 2: 3xxxx....
   3 : 3  ---------------> 3: 4xxxx....

We are checking that after most significant byte (first byte) changes,
then all fanout position up to the current first byte value (exclusive)
point to current position in OID Lookup chunk.

Seems all right; it would be nice to come up with rigorous proof that it
is all right.

> +	}
> +
> +	while (cur_fanout_pos < 256) {
> +		uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
> +
> +		if (g->num_commits != fanout_value)
> +			graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
> +				     cur_fanout_pos, fanout_value, i);
> +
> +		cur_fanout_pos++;
> +	}

All right, this checks that all remaining fanout entries, up and
including the 255-ith entry store the total number of commits, N.

> +
>  	return verify_commit_graph_error;
>  }
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 4ef3fe3dc2..c050ef980b 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
>  	git commit-graph verify >output
>  '
>  
> +HASH_LEN=20
>  GRAPH_BYTE_VERSION=4
>  GRAPH_BYTE_HASH=5
>  GRAPH_BYTE_CHUNK_COUNT=6
> @@ -258,6 +259,12 @@ GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
>  			      1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
>  GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
>  				2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
> +GRAPH_FANOUT_OFFSET=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
> +			  $GRAPH_CHUNK_LOOKUP_WIDTH \* $GRAPH_CHUNK_LOOKUP_ROWS`
> +GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
> +GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
> +GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
> +GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`

Something that I forgot to write about in previous patch.

POSIX shell includes $(( ... )) for arithmetic expansion, there is
nowadays no need to use $(expr ...), and even more no need to use
pre-POSIX `expr ...`.

TLDR: use $(( ... )), not `expr ... `.

>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -312,4 +319,19 @@ test_expect_success 'detect missing commit data chunk' '
>  		"missing the Commit Data chunk"
>  '
>  
> +test_expect_success 'detect incorrect fanout' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT1 "\01" \

How can you be sure that this value is incorrect?

> +		"fanout value"
> +'
> +
> +test_expect_success 'detect incorrect fanout' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
> +		"fanout value"
> +'

I would prefer if different tests had different description.  Those two
are both 'detect incorrect layout'.  What is the difference between them?

> +
> +test_expect_success 'detect incorrect OID order' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ORDER "\01" \
> +		"incorrect OID order"
> +'

How can you be sure that this value is incorrect, or rather that it
would be always incorrect?

> +
>  test_done

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand
  2018-05-27 22:55         ` Jakub Narebski
@ 2018-05-30 16:07           ` Derrick Stolee
  2018-06-02 21:19             ` Jakub Narebski
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-30 16:07 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 5/27/2018 6:55 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> If the commit-graph file becomes corrupt, we need a way to verify
>> that its contents match the object database. In the manner of
>> 'git fsck' we will implement a 'git commit-graph verify' subcommand
>> to report all issues with the file.
>>
>> Add the 'verify' subcommand to the 'commit-graph' builtin and its
>> documentation. The subcommand is currently a no-op except for
>> loading the commit-graph into memory, which may trigger run-time
>> errors that would be caught by normal use.
> So this commit is simply getting the boilerplate out of the way for
> implementing 'git commit-graph verify' subcommand.  Good.
>
>>                                             Add a simple test that
>> ensures the command returns a zero error code.
> Nice.
>
>> If no commit-graph file exists, this is an acceptable state. Do
>> not report any errors.
> All right.  I assume that as it is explicit verification call, it does
> ignore core.commitGraph setting, isn't it?
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-commit-graph.txt |  6 ++++++
>>   builtin/commit-graph.c             | 38 ++++++++++++++++++++++++++++++++++++++
>>   commit-graph.c                     | 26 ++++++++++++++++++++++++++
>>   commit-graph.h                     |  2 ++
>>   t/t5318-commit-graph.sh            | 10 ++++++++++
>>   5 files changed, 82 insertions(+)
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index 4c97b555cc..a222cfab08 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -10,6 +10,7 @@ SYNOPSIS
>>   --------
>>   [verse]
>>   'git commit-graph read' [--object-dir <dir>]
>> +'git commit-graph verify' [--object-dir <dir>]
>>   'git commit-graph write' <options> [--object-dir <dir>]
> In alphabetical order, good.
>
>>   
>>   
>> @@ -52,6 +53,11 @@ existing commit-graph file.
>>   Read a graph file given by the commit-graph file and output basic
>>   details about the graph file. Used for debugging purposes.
>>   
>> +'verify'::
>> +
>> +Read the commit-graph file and verify its contents against the object
>> +database. Used to check for corrupted data.
>> +
> All right, good enough description.
>
>>   
>>   EXAMPLES
>>   --------
>> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
>> index f0875b8bf3..0433dd6e20 100644
>> --- a/builtin/commit-graph.c
>> +++ b/builtin/commit-graph.c
>> @@ -8,10 +8,16 @@
>>   static char const * const builtin_commit_graph_usage[] = {
>>   	N_("git commit-graph [--object-dir <objdir>]"),
>>   	N_("git commit-graph read [--object-dir <objdir>]"),
>> +	N_("git commit-graph verify [--object-dir <objdir>]"),
>>   	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>>   	NULL
>>   };
> In alphabetical order, same as in the manpage for git-commit-graph.
>
>>   
>> +static const char * const builtin_commit_graph_verify_usage[] = {
>> +	N_("git commit-graph verify [--object-dir <objdir>]"),
>> +	NULL
>> +};
>> +
>>   static const char * const builtin_commit_graph_read_usage[] = {
>>   	N_("git commit-graph read [--object-dir <objdir>]"),
>>   	NULL
>> @@ -29,6 +35,36 @@ static struct opts_commit_graph {
>>   	int append;
>>   } opts;
>>   
>> +
>> +static int graph_verify(int argc, const char **argv)
>> +{
>> +	struct commit_graph *graph = 0;
>> +	char *graph_name;
>> +
>> +	static struct option builtin_commit_graph_verify_options[] = {
>> +		OPT_STRING(0, "object-dir", &opts.obj_dir,
>> +			   N_("dir"),
>> +			   N_("The object directory to store the graph")),
>> +		OPT_END(),
>> +	};
>> +
>> +	argc = parse_options(argc, argv, NULL,
>> +			     builtin_commit_graph_verify_options,
>> +			     builtin_commit_graph_verify_usage, 0);
>> +
>> +	if (!opts.obj_dir)
>> +		opts.obj_dir = get_object_directory();
> Getting the boilerplate of implementing the command mostly out of the
> way.  Good.
>
>> +
>> +	graph_name = get_commit_graph_filename(opts.obj_dir);
>> +	graph = load_commit_graph_one(graph_name);
> So we are verifying only the commit-graph file belonging directly to
> current repository, as I have expected.  This is needed to for warnings
> and error messages from the 'verify' action, to be able to tell in which
> file there are problems.
>
> This means that it is possible that there would be problems with
> commit-graph files that running 'git commit-graph verify' would not
> find, because they are in commit-graph file in one of the alternates.
>
> It is very easy, though, to check all commit-graph files that would be
> read and its data concatenated when using commit-graph feature
> (e.g. 'git commit-graph read', IIRC):
>
>    $ git commit-graph verify
>    $ for obj_dir in $(cat .git/objects/info/alternates) do;
>          git commit-graph --object-dir="$obj_dir";
>      done
>
> Note: I have not checked the above that it works.
>
>> +	FREE_AND_NULL(graph_name);
> Freeing the resources, always nice to have.
>
>> +
>> +	if (!graph)
>> +		return 0;
> DS> If no commit-graph file exists, this is an acceptable state. Do
> DS> not report any errors.
>
> Right, non existant commit-graph file is certainly valid ;-)
>
>> +
>> +	return verify_commit_graph(graph);
>> +}
> I guess that graph_verify() would not change much, if at all, in
> subsequent commits in this patch series.
>
>> +
>>   static int graph_read(int argc, const char **argv)
>>   {
>>   	struct commit_graph *graph = NULL;
>> @@ -163,6 +199,8 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>>   			     PARSE_OPT_STOP_AT_NON_OPTION);
>>   
>>   	if (argc > 0) {
>> +		if (!strcmp(argv[0], "verify"))
>> +			return graph_verify(argc, argv);
>>   		if (!strcmp(argv[0], "read"))
>>   			return graph_read(argc, argv);
>>   		if (!strcmp(argv[0], "write"))
> Not in alphabetical order... is there a reason for that?
This will be fixed in v4.
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 25893ec096..55b41664ee 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -836,3 +836,29 @@ void write_commit_graph(const char *obj_dir,
>>   	oids.alloc = 0;
>>   	oids.nr = 0;
>>   }
>> +
>> +static int verify_commit_graph_error;
>> +
>> +static void graph_report(const char *fmt, ...)
>> +{
>> +	va_list ap;
>> +	struct strbuf sb = STRBUF_INIT;
>> +	verify_commit_graph_error = 1;
>> +
>> +	va_start(ap, fmt);
>> +	strbuf_vaddf(&sb, fmt, ap);
>> +
>> +	fprintf(stderr, "%s\n", sb.buf);
>> +	strbuf_release(&sb);
>> +	va_end(ap);
> Why do you use strbuf_vaddf + fprintf instead of straighforward
> vfprintf (or function instead of variable-level macro)?
>
> Is it because of [string] safety?

It's because I've never used this variable-parameter thing before and 
found a different example.

I'll use vfprintf() in v4, as it is simpler.


>> +}
>> +
>> +int verify_commit_graph(struct commit_graph *g)
>> +{
>> +	if (!g) {
>> +		graph_report("no commit-graph file loaded");
>> +		return 1;
>> +	}
> All right, this is just a placeholder - we should not ever get this
> message because in this case we exit with error code of 0 (EXIT_SUCCESS)
> if there is no commit-graph file loaded before invoking
> verify_commit_graph().
>
>> +
>> +	return verify_commit_graph_error;
> All right, this is for the future.  Good.
>
>> +}
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 96cccb10f3..71a39c5a57 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
>>   			int nr_commits,
>>   			int append);
>>   
>> +int verify_commit_graph(struct commit_graph *g);
>> +
> Why does this need to be exported?  I think it is not used outside of
> commit-graph.c, isn't it?

Used by builtin/commit-graph.c


>
>>   #endif
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 77d85aefe7..6ca451dfd2 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
>>   	objdir=".git/objects"
>>   '
>>   
>> +test_expect_success 'verify graph with no graph file' '
>> +	cd "$TRASH_DIRECTORY/full" &&
> Is sich bare `cd`, without corresponding `cd` back or using subshell
> safe?
>
>> +	git commit-graph verify
>> +'
>> +
>>   test_expect_success 'write graph with no packs' '
>>   	cd "$TRASH_DIRECTORY/full" &&
>>   	git commit-graph write --object-dir . &&
>> @@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
>>   	test_cmp expect output
>>   '
>>   
>> +test_expect_success 'git commit-graph verify' '
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	git commit-graph verify >output
>> +'
> Those are tests with nearly the same code, but they are (by their
> descriptions) testing different things.  This means that they rely on
> side effects of earlier tests.
>
> This is suboptimal, as it means that it would be impossible or very
> difficult to run individual tests (e.g. with GIT_SKIP_TESTS environment
> variable, or with an individual test suite --run option), unless you
> know which tests setup the repository state for later tests.
>
> It also means that running only failed tests with prove
> --state=failed,save or equivalently with
>
>    $ make DEFAULT_TEST_TARGET=prove GIT_PROVE_OPTS='--state=failed,save' test
>
> wouldn't work correctly.
>
> As Johannes Schindelin (alias Dscho) said in latest Git Rev News
> interview: https://git.github.io/rev_news/2018/05/16/edition-39/
>
> JS> We have a test suite where debugging a regression may mean that you
> JS> have to run 98 test cases before the failing one every single time in
> JS> the edit/compile/debug cycle, because the 99th test case may depend on
> JS> a side effect of at least one of the preceding test cases. Git’s test
> JS> suite is so not [21st century best practices][1].
> JS>
> JS> [1]: https://www.slideshare.net/BuckHodges/lessons-learned-doing-devops-at-scale-at-microsoft
>
>
> I think can be solved quite efficiently by creating and using shell
> function, or two shell functions, which would either:
>
>   * rename commit-graph file to some other temporary name if it exists,
>     and move it back after the test.
>   * create commit-graph file if it does not exist.
>
> For example (untested):
>
>    prepare_no_commit_graph() {
>    	mv .git/info/commit-graph .git/info/commit-graph.away &&
>    	test_when_finished "mv .git/info/commit-graph.away .git/info/commit-graph"
>    }
>
>    prepare_commit_graph() {
>    	if ! test -f ".git/info/commit-graph"
>    	then
>    		git commit-graph write
>    	fi
>    }
>
> Or something like that.

Do we have a way to run individual steps of the test suite? I am unfamiliar with that process.

Adding the complexity of storing a copy of the commit-graph file for re-use in a later test is wasted energy right now, because we need to run the steps of the test that create the repo shape with the commits laid out as set earlier in the test. This shape changes as we test different states of the commit-graph (exists and contains all commits, exists and doesn't contain all commits, etc.)

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-05-30 13:34         ` Jakub Narebski
@ 2018-05-30 16:18           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-30 16:18 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff


On 5/30/2018 9:34 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> In the commit-graph file, the OID fanout chunk provides an index into
>> the OID lookup. The 'verify' subcommand should find incorrect values
>> in the fanout.
>>
>> Similarly, the 'verify' subcommand should find out-of-order values in
>> the OID lookup.
> O.K., the OID Lookup chunk is checked together with OID Fanout chunk
> because those two chunks are tightly connected: OID Fanout is fanout
> into OID Lookup.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c          | 36 ++++++++++++++++++++++++++++++++++++
>>   t/t5318-commit-graph.sh | 22 ++++++++++++++++++++++
>>   2 files changed, 58 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 06e3e4f9ba..cbd1aae514 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -855,6 +855,9 @@ static void graph_report(const char *fmt, ...)
>>   
>>   int verify_commit_graph(struct commit_graph *g)
>>   {
>> +	uint32_t i, cur_fanout_pos = 0;
>> +	struct object_id prev_oid, cur_oid;
> Minor nitpick about the naming convention: why cur_*, and not curr_*?
>
>> +
>>   	if (!g) {
>>   		graph_report("no commit-graph file loaded");
>>   		return 1;
>> @@ -869,5 +872,38 @@ int verify_commit_graph(struct commit_graph *g)
>>   	if (!g->chunk_commit_data)
>>   		graph_report("commit-graph is missing the Commit Data chunk");
>>   
>> +	if (verify_commit_graph_error)
>> +		return verify_commit_graph_error;
>> +
> Before checking that commits in OID Lookup are sorted, and that OID
> Fanout correctly points into OID Lookup (thus also checking that OID
> Lookup is sorted), it would be good to verify that sizes of those chunks
> are correct.
>
> Out of 4 standrd chunks, 1 of them (OIDF) has constant size, and 2 of
> them have size given by number of commits and hash size
>   - OID Fanout is (256 * 4 bytes)
>   - OID Lookup is (N * H bytes),
>     where N is number of commits, and H is hash size
>
> The one that is more significant is if number of commits gets corrupted
> upwards, and walking through OID Lookup would lead us out of bounds,
> outside the file size.
>
> IIRC we have checked that all chunks fit within file size, isn't it?
>
>> +	for (i = 0; i < g->num_commits; i++) {
>> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> Why do you use hashcpy() here, but oidcpy() below?

We are copying from a section of binary data, not from a struct object_id *.

>
>> +
>> +		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
> All right, OIDs needs to be sorted in ascending lexicographical order,
> and the above condition checks that this constraint is fullfilled.
>
>> +			graph_report("commit-graph has incorrect OID order: %s then %s",
>> +				     oid_to_hex(&prev_oid),
>> +				     oid_to_hex(&cur_oid));
> Nice informative error message.
>
>> +
>> +		oidcpy(&prev_oid, &cur_oid);
>> +
>> +		while (cur_oid.hash[0] > cur_fanout_pos) {
>> +			uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
>> +			if (i != fanout_value)
>> +				graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
>> +					     cur_fanout_pos, fanout_value, i);
>> +
>> +			cur_fanout_pos++;
>> +		}
> The k-th entry of fanout, F[k], stores the number of OIDs with first
> byte at most k.
>
> Let's examine it in detail on a simple example:
>
>     fanout                     lookup
>     ------                     ------
>     0 : 0  ---------------> 0: 1xxxx....
>     1 : 2  -----\           1: 1yyyy....
>     2 : 2  ------\--------> 2: 3xxxx....
>     3 : 3  ---------------> 3: 4xxxx....
>
> We are checking that after most significant byte (first byte) changes,
> then all fanout position up to the current first byte value (exclusive)
> point to current position in OID Lookup chunk.
>
> Seems all right; it would be nice to come up with rigorous proof that it
> is all right.
>
>> +	}
>> +
>> +	while (cur_fanout_pos < 256) {
>> +		uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
>> +
>> +		if (g->num_commits != fanout_value)
>> +			graph_report("commit-graph has incorrect fanout value: fanout[%d] = %u != %u",
>> +				     cur_fanout_pos, fanout_value, i);
>> +
>> +		cur_fanout_pos++;
>> +	}
> All right, this checks that all remaining fanout entries, up and
> including the 255-ith entry store the total number of commits, N.
>
>> +
>>   	return verify_commit_graph_error;
>>   }
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 4ef3fe3dc2..c050ef980b 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
>>   	git commit-graph verify >output
>>   '
>>   
>> +HASH_LEN=20
>>   GRAPH_BYTE_VERSION=4
>>   GRAPH_BYTE_HASH=5
>>   GRAPH_BYTE_CHUNK_COUNT=6
>> @@ -258,6 +259,12 @@ GRAPH_BYTE_OID_LOOKUP_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
>>   			      1 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
>>   GRAPH_BYTE_COMMIT_DATA_ID=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
>>   				2 \* $GRAPH_CHUNK_LOOKUP_WIDTH`
>> +GRAPH_FANOUT_OFFSET=`expr $GRAPH_CHUNK_LOOKUP_OFFSET + \
>> +			  $GRAPH_CHUNK_LOOKUP_WIDTH \* $GRAPH_CHUNK_LOOKUP_ROWS`
>> +GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
>> +GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
>> +GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
>> +GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
> Something that I forgot to write about in previous patch.
>
> POSIX shell includes $(( ... )) for arithmetic expansion, there is
> nowadays no need to use $(expr ...), and even more no need to use
> pre-POSIX `expr ...`.
>
> TLDR: use $(( ... )), not `expr ... `.

I'll use this convention in v4. Thanks!

>
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -312,4 +319,19 @@ test_expect_success 'detect missing commit data chunk' '
>>   		"missing the Commit Data chunk"
>>   '
>>   
>> +test_expect_success 'detect incorrect fanout' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT1 "\01" \
> How can you be sure that this value is incorrect?

The repo is created using constant information (the commit/author dates 
are set by environment variables in the test environment). The 
commit-graph file that is generated by the test script is identical each 
time.


>
>> +		"fanout value"
>> +'
>> +
>> +test_expect_success 'detect incorrect fanout' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_FANOUT2 "\01" \
>> +		"fanout value"
>> +'
> I would prefer if different tests had different description.  Those two
> are both 'detect incorrect layout'.  What is the difference between them?
I'll specify this one is for the final value.
>
>> +
>> +test_expect_success 'detect incorrect OID order' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_ORDER "\01" \
>> +		"incorrect OID order"
>> +'
> How can you be sure that this value is incorrect, or rather that it
> would be always incorrect?
>
>> +
>>   test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 10/20] commit-graph: verify objects exist
  2018-05-24 16:25       ` [PATCH v3 10/20] commit-graph: verify objects exist Derrick Stolee
@ 2018-05-30 19:22         ` Jakub Narebski
  2018-05-31 12:53           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-30 19:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> In the 'verify' subcommand, load commits directly from the object
> database to ensure they exist. Parse by skipping the commit-graph.

All right, before we check that the commit data matches, we need to
check that all the commits in cache (in the serialized commit graph) are
present in real data (in the object database of the repository).

What's nice of this series is that the operation that actually removes
unreachable commits from the object database, that is `git gc`, would
also update commit-gaph file.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 20 ++++++++++++++++++++
>  t/t5318-commit-graph.sh |  7 +++++++
>  2 files changed, 27 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cbd1aae514..0420ebcd87 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>  {
>  	struct commit *c;
>  	struct object_id oid;
> +
> +	if (pos >= g->num_commits)
> +		die("invalid parent position %"PRIu64, pos);
> +

This change is not described in the commit message.

>  	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
>  	c = lookup_commit(&oid);
>  	if (!c)
> @@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g)
>  		cur_fanout_pos++;
>  	}
>  
> +	if (verify_commit_graph_error)
> +		return verify_commit_graph_error;

All right, so we by default short-circuit so that errors found earlier
wouldn't cause crash when checking other things.

Is it needed, though, in this case?  Though I guess it is better to be
conservative; lso by terminating after a batch of one type of errors we
don't get many different error messages from the same error (i.e. error
propagation).

> +
> +	for (i = 0; i < g->num_commits; i++) {
> +		struct commit *odb_commit;
> +
> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> +
> +		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());

Do we really need to keep all those commits from the object database in
memory (in the object::obj_hash hash table)?  Perhaps using something
like Flywheel / Recycler pattern would be a better solution (if
possible)?

Though perhaps this doesn't matter much with respect to memory use.

> +		if (parse_commit_internal(odb_commit, 0, 0)) {

Just a reminder to myself: the signature is

  int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)


Hmmm... I wonder if with two boolean paramaters wouldn't it be better to
use flags parameter, i.e.

  int parse_commit_internal(struct commit *item, int flags)

  ...

  parse_commit_internal(commit, QUIET_ON_MISSING | USE_COMMIT_GRAPH)

But I guess that it is not worth it (especially for internal-ish
function).

> +			graph_report("failed to parse %s from object database",
> +				     oid_to_hex(&cur_oid));

Wouldn't parse_commit_internal() show it's own error message, in
addition to the one above?

> +			continue;
> +		}
> +	}
> +
>  	return verify_commit_graph_error;
>  }
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index c050ef980b..996a016239 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
>  	git commit-graph verify >output
>  '
>  
> +NUM_COMMITS=9
>  HASH_LEN=20
>  GRAPH_BYTE_VERSION=4
>  GRAPH_BYTE_HASH=5
> @@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
>  GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
>  GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
>  GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
> +GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`

All right, so we modify 10-the byte of OID of 5-th commit out of 9,
am I correct here?

>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' '
>  		"incorrect OID order"
>  '
>  
> +test_expect_success 'detect OID not in object database' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \
> +		"from object database"
> +'

I guess that if we ensure that OIDs are constant, you have chosen the
change to actually corrupt the OID in OID Lookup chunk to point to OID
that is not in the object database (while still not violating the
constraint that OID in OID Lookup chunk needs to be sorted).

> +
>  test_done

All right (well, except for `expr ... ` --> $(( ... )) change).

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 11/20] commit-graph: verify root tree OIDs
  2018-05-24 16:25       ` [PATCH v3 11/20] commit-graph: verify root tree OIDs Derrick Stolee
@ 2018-05-30 22:24         ` Jakub Narebski
  2018-05-31 13:16           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-05-30 22:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The 'verify' subcommand must compare the commit content parsed from the
> commit-graph and compare it against the content in the object database.

You have "compare" twice in the above sentence.

> Use lookup_commit() and parse_commit_in_graph_one() to parse the commits
> from the graph and compare against a commit that is loaded separately
> and parsed directly from the object database.

All right, that looks like a nice extension of what was done in previous
patch.  We want to check that cache (serialized commit graph) matches
reality (object database).

>
> Add checks for the root tree OID.

All right; isn't it that now we check almost all information from
commit-graph that hs match in object database (with exception of commit
parents, I think).

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 17 ++++++++++++++++-
>  t/t5318-commit-graph.sh |  7 +++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 0420ebcd87..19ea369fc6 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g)
>  		return verify_commit_graph_error;

NOTE: we will be checking Commit Data chunk; I think it would be good
idea to verify that size of Commit Data chunk matches (N * (H + 16) bytes)
that format gives us, so that we don't accidentally red outside of
memory if something got screwed up (like number of commits being wrong,
or file truncated).

>  
>  	for (i = 0; i < g->num_commits; i++) {
> +		struct commit *graph_commit;
> +
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>  
>  		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
> @@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g)
>  
>  			cur_fanout_pos++;
>  		}
> +
> +		graph_commit = lookup_commit(&cur_oid);

So now I see why we add all commits to memory (to hash structure).

> +		if (!parse_commit_in_graph_one(g, graph_commit))
> +			graph_report("failed to parse %s from commit-graph",
> +				     oid_to_hex(&cur_oid));

All right, this verifies that commit in OID Lookup chunk has parse-able
data in Commit Data chunk, isn't it?

>  	}
>  
>  	while (cur_fanout_pos < 256) {
> @@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g)
>  		return verify_commit_graph_error;
>  
>  	for (i = 0; i < g->num_commits; i++) {
> -		struct commit *odb_commit;
> +		struct commit *graph_commit, *odb_commit;
>  
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>  
> +		graph_commit = lookup_commit(&cur_oid);
>  		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());

All right, so we have commit data from graph, and commit data from the
object database.

>  		if (parse_commit_internal(odb_commit, 0, 0)) {
>  			graph_report("failed to parse %s from object database",
>  				     oid_to_hex(&cur_oid));
>  			continue;
>  		}
> +
> +		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
> +			   get_commit_tree_oid(odb_commit)))
> +			graph_report("root tree OID for commit %s in commit-graph is %s != %s",
> +				     oid_to_hex(&cur_oid),
> +				     oid_to_hex(get_commit_tree_oid(graph_commit)),
> +				     oid_to_hex(get_commit_tree_oid(odb_commit)));

Maybe explicitly say that it doesn't match the value from the object
database; OTOH this may be too verbose.

>  	}
>  
>  	return verify_commit_graph_error;
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 996a016239..21cc8e82f3 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
>  GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
>  GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
>  GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
> +GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS`
> +GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET

All right, so the first is entry into record in Commit Data chunk, and
the latter points into tree entry in this record -- which entry is first
field:

  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
    * The first H bytes are for the OID of the root tree.

>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -341,4 +343,9 @@ test_expect_success 'detect OID not in object database' '
>  		"from object database"
>  '
>  
> +test_expect_success 'detect incorrect tree OID' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_TREE "\01" \
> +		"root tree OID for commit"
> +'

All right.

I wonder if we can create a test for first check added (that Commit
Graph data parses correctly), that is the one with the following error
message:

  "failed to parse <OID> from commit-graph file".

> +
>  test_done

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 10/20] commit-graph: verify objects exist
  2018-05-30 19:22         ` Jakub Narebski
@ 2018-05-31 12:53           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-05-31 12:53 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 5/30/2018 3:22 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> In the 'verify' subcommand, load commits directly from the object
>> database to ensure they exist. Parse by skipping the commit-graph.
> All right, before we check that the commit data matches, we need to
> check that all the commits in cache (in the serialized commit graph) are
> present in real data (in the object database of the repository).
>
> What's nice of this series is that the operation that actually removes
> unreachable commits from the object database, that is `git gc`, would
> also update commit-gaph file.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c          | 20 ++++++++++++++++++++
>>   t/t5318-commit-graph.sh |  7 +++++++
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index cbd1aae514..0420ebcd87 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -238,6 +238,10 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
>>   {
>>   	struct commit *c;
>>   	struct object_id oid;
>> +
>> +	if (pos >= g->num_commits)
>> +		die("invalid parent position %"PRIu64, pos);
>> +
> This change is not described in the commit message.
This change should go in "commit-graph: verify parent list" which adds a 
test that fails without it. Thanks.

>>   	hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
>>   	c = lookup_commit(&oid);
>>   	if (!c)
>> @@ -905,5 +909,21 @@ int verify_commit_graph(struct commit_graph *g)
>>   		cur_fanout_pos++;
>>   	}
>>   
>> +	if (verify_commit_graph_error)
>> +		return verify_commit_graph_error;
> All right, so we by default short-circuit so that errors found earlier
> wouldn't cause crash when checking other things.
>
> Is it needed, though, in this case?  Though I guess it is better to be
> conservative; lso by terminating after a batch of one type of errors we
> don't get many different error messages from the same error (i.e. error
> propagation).
>
>> +
>> +	for (i = 0; i < g->num_commits; i++) {
>> +		struct commit *odb_commit;
>> +
>> +		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>> +
>> +		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
> Do we really need to keep all those commits from the object database in
> memory (in the object::obj_hash hash table)?  Perhaps using something
> like Flywheel / Recycler pattern would be a better solution (if
> possible)?
>
> Though perhaps this doesn't matter much with respect to memory use.
>
>> +		if (parse_commit_internal(odb_commit, 0, 0)) {
> Just a reminder to myself: the signature is
>
>    int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_commit_graph)
>
>
> Hmmm... I wonder if with two boolean paramaters wouldn't it be better to
> use flags parameter, i.e.
>
>    int parse_commit_internal(struct commit *item, int flags)
>
>    ...
>
>    parse_commit_internal(commit, QUIET_ON_MISSING | USE_COMMIT_GRAPH)
>
> But I guess that it is not worth it (especially for internal-ish
> function).
>
>> +			graph_report("failed to parse %s from object database",
>> +				     oid_to_hex(&cur_oid));
> Wouldn't parse_commit_internal() show it's own error message, in
> addition to the one above?
>
>> +			continue;
>> +		}
>> +	}
>> +
>>   	return verify_commit_graph_error;
>>   }
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index c050ef980b..996a016239 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -247,6 +247,7 @@ test_expect_success 'git commit-graph verify' '
>>   	git commit-graph verify >output
>>   '
>>   
>> +NUM_COMMITS=9
>>   HASH_LEN=20
>>   GRAPH_BYTE_VERSION=4
>>   GRAPH_BYTE_HASH=5
>> @@ -265,6 +266,7 @@ GRAPH_BYTE_FANOUT1=`expr $GRAPH_FANOUT_OFFSET + 4 \* 4`
>>   GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
>>   GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
>>   GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
>> +GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
> All right, so we modify 10-the byte of OID of 5-th commit out of 9,
> am I correct here?
>
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -334,4 +336,9 @@ test_expect_success 'detect incorrect OID order' '
>>   		"incorrect OID order"
>>   '
>>   
>> +test_expect_success 'detect OID not in object database' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_OID_LOOKUP_MISSING "\01" \
>> +		"from object database"
>> +'
> I guess that if we ensure that OIDs are constant, you have chosen the
> change to actually corrupt the OID in OID Lookup chunk to point to OID
> that is not in the object database (while still not violating the
> constraint that OID in OID Lookup chunk needs to be sorted).
>
>> +
>>   test_done
> All right (well, except for `expr ... ` --> $(( ... )) change).
>


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 11/20] commit-graph: verify root tree OIDs
  2018-05-30 22:24         ` Jakub Narebski
@ 2018-05-31 13:16           ` Derrick Stolee
  2018-06-02 22:50             ` Jakub Narebski
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-05-31 13:16 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 5/30/2018 6:24 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The 'verify' subcommand must compare the commit content parsed from the
>> commit-graph and compare it against the content in the object database.
> You have "compare" twice in the above sentence.
>
>> Use lookup_commit() and parse_commit_in_graph_one() to parse the commits
>> from the graph and compare against a commit that is loaded separately
>> and parsed directly from the object database.
> All right, that looks like a nice extension of what was done in previous
> patch.  We want to check that cache (serialized commit graph) matches
> reality (object database).
>
>> Add checks for the root tree OID.
> All right; isn't it that now we check almost all information from
> commit-graph that hs match in object database (with exception of commit
> parents, I think).
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c          | 17 ++++++++++++++++-
>>   t/t5318-commit-graph.sh |  7 +++++++
>>   2 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 0420ebcd87..19ea369fc6 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -880,6 +880,8 @@ int verify_commit_graph(struct commit_graph *g)
>>   		return verify_commit_graph_error;
> NOTE: we will be checking Commit Data chunk; I think it would be good
> idea to verify that size of Commit Data chunk matches (N * (H + 16) bytes)
> that format gives us, so that we don't accidentally red outside of
> memory if something got screwed up (like number of commits being wrong,
> or file truncated).

This is actually how we calculate 'num_commits' during 
load_commit_graph_one():

     if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
     {
         graph->num_commits = (chunk_offset - last_chunk_offset)
                              / graph->hash_len;
     }

So, if the chunk doesn't match N*(H+16), we detect this because 
FANOUT[255] != N.

(There is one caveat here: (chunk_offset - last_chunk_offset) may not be 
a multiple of hash_len, and "accidentally" truncate to N in the 
division. I'll add more careful checks for this.)

We also check out-of-bounds offsets in that method.

>
>>   
>>   	for (i = 0; i < g->num_commits; i++) {
>> +		struct commit *graph_commit;
>> +
>>   		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>>   
>>   		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
>> @@ -897,6 +899,11 @@ int verify_commit_graph(struct commit_graph *g)
>>   
>>   			cur_fanout_pos++;
>>   		}
>> +
>> +		graph_commit = lookup_commit(&cur_oid);
> So now I see why we add all commits to memory (to hash structure).
>
>> +		if (!parse_commit_in_graph_one(g, graph_commit))
>> +			graph_report("failed to parse %s from commit-graph",
>> +				     oid_to_hex(&cur_oid));
> All right, this verifies that commit in OID Lookup chunk has parse-able
> data in Commit Data chunk, isn't it?
>
>>   	}
>>   
>>   	while (cur_fanout_pos < 256) {
>> @@ -913,16 +920,24 @@ int verify_commit_graph(struct commit_graph *g)
>>   		return verify_commit_graph_error;
>>   
>>   	for (i = 0; i < g->num_commits; i++) {
>> -		struct commit *odb_commit;
>> +		struct commit *graph_commit, *odb_commit;
>>   
>>   		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>>   
>> +		graph_commit = lookup_commit(&cur_oid);
>>   		odb_commit = (struct commit *)create_object(cur_oid.hash, alloc_commit_node());
> All right, so we have commit data from graph, and commit data from the
> object database.
>
>>   		if (parse_commit_internal(odb_commit, 0, 0)) {
>>   			graph_report("failed to parse %s from object database",
>>   				     oid_to_hex(&cur_oid));
>>   			continue;
>>   		}
>> +
>> +		if (oidcmp(&get_commit_tree_in_graph_one(g, graph_commit)->object.oid,
>> +			   get_commit_tree_oid(odb_commit)))
>> +			graph_report("root tree OID for commit %s in commit-graph is %s != %s",
>> +				     oid_to_hex(&cur_oid),
>> +				     oid_to_hex(get_commit_tree_oid(graph_commit)),
>> +				     oid_to_hex(get_commit_tree_oid(odb_commit)));
> Maybe explicitly say that it doesn't match the value from the object
> database; OTOH this may be too verbose.
>
>>   	}
>>   
>>   	return verify_commit_graph_error;
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 996a016239..21cc8e82f3 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -267,6 +267,8 @@ GRAPH_BYTE_FANOUT2=`expr $GRAPH_FANOUT_OFFSET + 4 \* 255`
>>   GRAPH_OID_LOOKUP_OFFSET=`expr $GRAPH_FANOUT_OFFSET + 4 \* 256`
>>   GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
>>   GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
>> +GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS`
>> +GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
> All right, so the first is entry into record in Commit Data chunk, and
> the latter points into tree entry in this record -- which entry is first
> field:
>
>    Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
>      * The first H bytes are for the OID of the root tree.
>
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -341,4 +343,9 @@ test_expect_success 'detect OID not in object database' '
>>   		"from object database"
>>   '
>>   
>> +test_expect_success 'detect incorrect tree OID' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_TREE "\01" \
>> +		"root tree OID for commit"
>> +'
> All right.
>
> I wonder if we can create a test for first check added (that Commit
> Graph data parses correctly), that is the one with the following error
> message:
>
>    "failed to parse <OID> from commit-graph file".
>
>> +
>>   test_done
> Regards,


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 12/20] commit-graph: verify parent list
  2018-05-24 16:25       ` [PATCH v3 12/20] commit-graph: verify parent list Derrick Stolee
@ 2018-06-01 23:21         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-01 23:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file stores parents in a two-column portion of the
> commit data chunk. If there is only one parent, then the second column
> stores 0xFFFFFFFF to indicate no second parent.

All right, it is certainly nice to have this information, but isn't it
something that one caan find in Documentation/technical/commit-graph-format.txt?

>
> The 'verify' subcommand checks the parent list for the commit loaded
> from the commit-graph and the one parsed from the object database. Test
> these checks for corrupt parents, too many parents, and wrong parents.
>
> The octopus merge will be tested in a later commit.

Does this mean that after this commit but before the next one the
'verify' subcommand would have false negatives for octopus merges
(falsely indicating that commit-graph is invalid)?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 25 +++++++++++++++++++++++++
>  t/t5318-commit-graph.sh | 18 ++++++++++++++++++
>  2 files changed, 43 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 19ea369fc6..fff22dc0c3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -921,6 +921,7 @@ int verify_commit_graph(struct commit_graph *g)
>  
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
> +		struct commit_list *graph_parents, *odb_parents;
>  
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>  
> @@ -938,6 +939,30 @@ int verify_commit_graph(struct commit_graph *g)
>  				     oid_to_hex(&cur_oid),
>  				     oid_to_hex(get_commit_tree_oid(graph_commit)),
>  				     oid_to_hex(get_commit_tree_oid(odb_commit)));
> +
> +		graph_parents = graph_commit->parents;
> +		odb_parents = odb_commit->parents;
> +
> +		while (graph_parents) {
> +			if (odb_parents == NULL) {
> +				graph_report("commit-graph parent list for commit %s is too long",
> +					     oid_to_hex(&cur_oid));
> +				break;
> +			}

All right, so this would catch the situation where there are more
parents for a commit in commit-graph than they are in the object
database version.

> +
> +			if (oidcmp(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
> +				graph_report("commit-graph parent for %s is %s != %s",
> +					     oid_to_hex(&cur_oid),
> +					     oid_to_hex(&graph_parents->item->object.oid),
> +					     oid_to_hex(&odb_parents->item->object.oid));

All right, so this would catch the situation where parents do not match
between commit-graph and the object database.

> +
> +			graph_parents = graph_parents->next;
> +			odb_parents = odb_parents->next;
> +		}
> +
> +		if (odb_parents != NULL)
> +			graph_report("commit-graph parent list for commit %s terminates early",
> +				     oid_to_hex(&cur_oid));

So this would catch the situation where there are more parents for a
commit in the object database than they are in the commit-graph.  Does
this handle octopus merges automatically, or is it left for the future
work/commit?

>  	}
>  
>  	return verify_commit_graph_error;
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 21cc8e82f3..12f0d7f54d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -269,6 +269,9 @@ GRAPH_BYTE_OID_LOOKUP_ORDER=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 8`
>  GRAPH_BYTE_OID_LOOKUP_MISSING=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* 4 + 10`
>  GRAPH_COMMIT_DATA_OFFSET=`expr $GRAPH_OID_LOOKUP_OFFSET + $HASH_LEN \* $NUM_COMMITS`
>  GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
> +GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
> +GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
> +GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -348,4 +351,19 @@ test_expect_success 'detect incorrect tree OID' '
>  		"root tree OID for commit"
>  '
>  
> +test_expect_success 'detect incorrect parent int-id' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_PARENT "\01" \
> +		"invalid parent"
> +'

So this would actually be caught by code introduced earlier, and not by
the one introduced in this commit -- but logically this test belongs
here, ian't it?

> +
> +test_expect_success 'detect extra parent int-id' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_EXTRA_PARENT "\00" \
> +		"is too long"
> +'

O.K., so the commit has one parent and we have corrupted it to read as
if it had more than one (and commit-graph would then have more parents
than reality, that is the object database).

Sidenote: I think we can use regexp for checking if the error message
matches, isn't it?

> +
> +test_expect_success 'detect incorrect tree OID' '

Errr... what?  The name of this test seems very wrong...

> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_WRONG_PARENT "\01" \
> +		"commit-graph parent for"
> +'

So here you modify the prent list in commit graph so that the commit
number points fits within commit-graph; it would of course make the
commit-graph and the object database version of parents do not match.
Good.

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-05-24 16:25       ` [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup Derrick Stolee
  2018-05-30 13:34         ` Jakub Narebski
@ 2018-06-02  4:38         ` Duy Nguyen
  2018-06-04 11:32           ` Derrick Stolee
  1 sibling, 1 reply; 149+ messages in thread
From: Duy Nguyen @ 2018-06-02  4:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, jnareb, stolee, avarab, marten.agren, peff

On Thu, May 24, 2018 at 6:25 PM, Derrick Stolee <dstolee@microsoft.com> wrote:
> +               if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
> +                       graph_report("commit-graph has incorrect OID order: %s then %s",
> +                                    oid_to_hex(&prev_oid),
> +                                    oid_to_hex(&cur_oid));

Should these strings be marked for translation with _()?
-- 
Duy

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 13/20] commit-graph: verify generation number
  2018-05-24 16:25       ` [PATCH v3 13/20] commit-graph: verify generation number Derrick Stolee
@ 2018-06-02 12:23         ` Jakub Narebski
  2018-06-04 11:47           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 12:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> While iterating through the commit parents, perform the generation
> number calculation and compare against the value stored in the
> commit-graph.

All right, that's good.

What about commit-graph files that have GENERATION_NUMBER_ZERO for all
its commits (because we verify single commit-graph file only, there
wouldn't be GENERATION_NUMBER_ZERO mixed with non-zero generation
numbers)?

Unless we can assume that no commit-graph file in the wild would have
GENERATION_NUMBER_ZERO.

>
> The tests demonstrate that having a different set of parents affects
> the generation number calculation, and this value propagates to
> descendants. Hence, we drop the single-line condition on the output.

I don't understand what part of changes this paragraph of the commit
message refers to.

Anyway, changing parents may not lead to changed generation numbers;
take for example commit with single parent, which we change to other
commit with the same generation number.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 18 ++++++++++++++++++
>  t/t5318-commit-graph.sh |  6 ++++++

Sidenote: I have just realized that it may be better to put
validation-related tests into different test file.

>  2 files changed, 24 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index fff22dc0c3..ead92460c1 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -922,6 +922,7 @@ int verify_commit_graph(struct commit_graph *g)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> +		uint32_t max_generation = 0;
>  
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>  
> @@ -956,6 +957,9 @@ int verify_commit_graph(struct commit_graph *g)
>  					     oid_to_hex(&graph_parents->item->object.oid),
>  					     oid_to_hex(&odb_parents->item->object.oid));
>  
> +			if (graph_parents->item->generation > max_generation)
> +				max_generation = graph_parents->item->generation;
> +

All right, that calculates the maximum of generation number of commit
parents.

>  			graph_parents = graph_parents->next;
>  			odb_parents = odb_parents->next;
>  		}
> @@ -963,6 +967,20 @@ int verify_commit_graph(struct commit_graph *g)
>  		if (odb_parents != NULL)
>  			graph_report("commit-graph parent list for commit %s terminates early",
>  				     oid_to_hex(&cur_oid));
> +
> +		/*
> +		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> +		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> +		 * extra logic in the following condition.
> +		 */

Nice trick.

> +		if (max_generation == GENERATION_NUMBER_MAX)
> +			max_generation--;

What about GENERATION_NUMBER_ZERO?

> +
> +		if (graph_commit->generation != max_generation + 1)
> +			graph_report("commit-graph generation for commit %s is %u != %u",
> +				     oid_to_hex(&cur_oid),
> +				     graph_commit->generation,
> +				     max_generation + 1);

I think we should also check that generation numbers do not exceed
GENERATION_NUMBER_MAX... unless it is already taken care of?

>  	}
>  
>  	return verify_commit_graph_error;
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 12f0d7f54d..673b0d37d5 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -272,6 +272,7 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
>  GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
>  GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
> +GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -366,4 +367,9 @@ test_expect_success 'detect incorrect tree OID' '
>  		"commit-graph parent for"
>  '
>  
> +test_expect_success 'detect incorrect generation number' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \

I assume that you have checked that it actually corrupts generation
number (without affecting commit date).

> +		"generation"

A very minor nitpick: Not "generation for commit"?

> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 14/20] commit-graph: verify commit date
  2018-05-24 16:25       ` [PATCH v3 14/20] commit-graph: verify commit date Derrick Stolee
@ 2018-06-02 12:29         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 12:29 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

Nice and simple.  The only possible question may be the ordering of
patches in the series, namely whether this change should be before or
after test checking generation numbers.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 6 ++++++
>  t/t5318-commit-graph.sh | 6 ++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index ead92460c1..d2b291aca2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -981,6 +981,12 @@ int verify_commit_graph(struct commit_graph *g)
>  				     oid_to_hex(&cur_oid),
>  				     graph_commit->generation,
>  				     max_generation + 1);
> +
> +		if (graph_commit->date != odb_commit->date)
> +			graph_report("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime,
> +				     oid_to_hex(&cur_oid),
> +				     graph_commit->date,
> +				     odb_commit->date);

All right.  Very straightforward, good.

>  	}
>  
>  	return verify_commit_graph_error;
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 673b0d37d5..58adb8246d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -273,6 +273,7 @@ GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
>  GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
>  GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
> +GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -372,4 +373,9 @@ test_expect_success 'detect incorrect generation number' '
>  		"generation"
>  '
>  
> +test_expect_success 'detect incorrect commit date' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_DATE "\01" \
> +		"commit date"
> +'

All right.

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 15/20] commit-graph: test for corrupted octopus edge
  2018-05-24 16:25       ` [PATCH v3 15/20] commit-graph: test for corrupted octopus edge Derrick Stolee
@ 2018-06-02 12:39         ` Jakub Narebski
  2018-06-04 13:08           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 12:39 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file has an extra chunk to store the parent int-ids for
> parents beyond the first parent for octopus merges. Our test repo has a
> single octopus merge that we can manipulate to demonstrate the 'verify'
> subcommand detects incorrect values in that chunk.

If I understand it correctly the above means that our _reading_ code
checks for validity (which then 'git commit-graph verify' uses), just
there were not any tests for that.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t5318-commit-graph.sh | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 58adb8246d..240aef6add 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -248,6 +248,7 @@ test_expect_success 'git commit-graph verify' '
>  '
>  
>  NUM_COMMITS=9
> +NUM_OCTOPUS_EDGES=2
>  HASH_LEN=20
>  GRAPH_BYTE_VERSION=4
>  GRAPH_BYTE_HASH=5
> @@ -274,6 +275,10 @@ GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
>  GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
>  GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
>  GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
> +GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
> +GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
> +				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
> +GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -378,4 +383,9 @@ test_expect_success 'detect incorrect commit date' '
>  		"commit date"
>  '
>  
> +test_expect_success 'detect incorrect parent for octopus merge' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_OCTOPUS "\01" \
> +		"invalid parent"
> +'

So we change the int-id to non-existing commit, and check that
commit-graph code checks for that.

What about the case when there are octopus merges, but no EDGE chunk
(which I think we can emulate by changing / corrupting number of
chunks)?

What about the case where int-id of edge in EDGE chunk is correct, that
is points to a valid commit, but does not agree with what is in the
object database (what parents octopus merge has in reality)?

Do we detect the situation where the second parent value in the commit
data stores an array position within a Large Edge chunk, but we do not
reach a value with the most-significant bit on when reaching the end of
Large Edge chunk?

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 16/20] commit-graph: verify contents match checksum
  2018-05-24 16:26       ` [PATCH v3 16/20] commit-graph: verify contents match checksum Derrick Stolee
  2018-05-30 12:35         ` SZEDER Gábor
@ 2018-06-02 15:52         ` Jakub Narebski
  2018-06-04 11:55           ` Derrick Stolee
  1 sibling, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 15:52 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file ends with a SHA1 hash of the previous contents. If
> a commit-graph file has errors but the checksum hash is correct, then we
> know that the problem is a bug in Git and not simply file corruption
> after-the-fact.
>
> Compute the checksum right away so it is the first error that appears,
> and make the message translatable since this error can be "corrected" by
> a user by simply deleting the file and recomputing. The rest of the
> errors are useful only to developers.

Should we then provide --quiet / --verbose options, so that ordinary
user is not flooded with error messages meant for power users and Git
developers, then?

>
> Be sure to continue checking the rest of the file data if the checksum
> is wrong. This is important for our tests, as we break the checksum as
> we modify bytes of the commit-graph file.

Well, we could have used sha1sum program, or test-sha1 helper to fix the
checksum after corrupting the commit-graph file...

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c          | 16 ++++++++++++++--
>  t/t5318-commit-graph.sh |  6 ++++++
>  2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index d2b291aca2..a33600c584 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir,
>  	oids.nr = 0;
>  }
>  
> +#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
>  static int verify_commit_graph_error;
>  
>  static void graph_report(const char *fmt, ...)
> @@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...)
>  int verify_commit_graph(struct commit_graph *g)
>  {
>  	uint32_t i, cur_fanout_pos = 0;
> -	struct object_id prev_oid, cur_oid;
> +	struct object_id prev_oid, cur_oid, checksum;
> +	struct hashfile *f;
> +	int devnull;
>  
>  	if (!g) {
>  		graph_report("no commit-graph file loaded");
> @@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g)
>  	if (verify_commit_graph_error)
>  		return verify_commit_graph_error;
>  
> +	devnull = open("/dev/null", O_WRONLY);
> +	f = hashfd(devnull, NULL);
> +	hashwrite(f, g->data, g->data_len - g->hash_len);
> +	finalize_hashfile(f, checksum.hash, CSUM_CLOSE);
> +	if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) {
> +		graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));
> +		verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
> +	}

Is it the best way of calculating the SHA-1 checksum that out internal
APIs provide?  Is it how SHA-1 checksum is calculated and checked for
packfiles?

> +
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit;
>  
> @@ -916,7 +928,7 @@ int verify_commit_graph(struct commit_graph *g)
>  		cur_fanout_pos++;
>  	}
>  
> -	if (verify_commit_graph_error)
> +	if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH)
>  		return verify_commit_graph_error;

So if we detected that checksum do not match, but we have not found an
error, we say that it is all right?

>  
>  	for (i = 0; i < g->num_commits; i++) {
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 240aef6add..2680a2ebff 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -279,6 +279,7 @@ GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
>  GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
>  				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
>  GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
> +GRAPH_BYTE_FOOTER=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4 \* $NUM_OCTOPUS_EDGES`
>  
>  # usage: corrupt_graph_and_verify <position> <data> <string>
>  # Manipulates the commit-graph file at the position
> @@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopus merge' '
>  		"invalid parent"
>  '
>  
> +test_expect_success 'detect invalid checksum hash' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
> +		"incorrect checksum"

This would not work under GETTEXT_POISON, as the message is marked as
translatable, but corrupt_graph_and_verify uses 'grep' and not
'test_i18grep' from t/test-lib-functions.sh

> +'

If it is pure checksum corruption, wouldn't this fail because it is not
a failure (exit code is 0)?

> +
>  test_done

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 17/20] fsck: verify commit-graph
  2018-05-24 16:26       ` [PATCH v3 17/20] fsck: verify commit-graph Derrick Stolee
@ 2018-06-02 16:17         ` Jakub Narebski
  2018-06-04 11:59           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 16:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> If core.commitGraph is true, verify the contents of the commit-graph
> during 'git fsck' using the 'git commit-graph verify' subcommand. Run
> this check on all alternates, as well.

All right, so we have one config variable to control the use of
serialized commit-graph feaature.  Nice.

>
> We use a new process for two reasons:
>
> 1. The subcommand decouples the details of loading and verifying a
>    commit-graph file from the other fsck details.

All right, I can agree with that.

On the other hand using subcommand makes debugging harder, though not in
this case (well separated functionality that can be easily called with a
standalone command to be debugged).

>
> 2. The commit-graph verification requires the commits to be loaded
>    in a specific order to guarantee we parse from the commit-graph
>    file for some objects and from the object database for others.

I don't quite understand this.  Could you explain it in more detail?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-fsck.txt |  3 +++
>  builtin/fsck.c             | 21 +++++++++++++++++++++
>  t/t5318-commit-graph.sh    |  8 ++++++++
>  3 files changed, 32 insertions(+)
>
> diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
> index b9f060e3b2..ab9a93fb9b 100644
> --- a/Documentation/git-fsck.txt
> +++ b/Documentation/git-fsck.txt
> @@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives
>  (i.e., you can just remove them and do an 'rsync' with some other site in
>  the hopes that somebody else has the object you have corrupted).
>  
> +If core.commitGraph is true, the commit-graph file will also be inspected

Shouldn't we use `core.commitGraph` here?

> +using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
> +
>  Extracted Diagnostics
>  ---------------------
>  
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index ef78c6c00c..a6d5045b77 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -16,6 +16,7 @@
>  #include "streaming.h"
>  #include "decorate.h"
>  #include "packfile.h"
> +#include "run-command.h"
>  
>  #define REACHABLE 0x0001
>  #define SEEN      0x0002
> @@ -45,6 +46,7 @@ static int name_objects;
>  #define ERROR_REACHABLE 02
>  #define ERROR_PACK 04
>  #define ERROR_REFS 010
> +#define ERROR_COMMIT_GRAPH 020

Minor nitpick and a sidenote: I wonder if it wouldn't be better to
either use hexadecimal constants, or use (1 << n) for all ERROR_*
preprocesor constants.

>  
>  static const char *describe_object(struct object *obj)
>  {
> @@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	check_connectivity();
> +
> +	if (core_commit_graph) {
> +		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
> +		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL };

I see that NULL at index 2 and 3 (at 3rd and 4th place) are here for
"--object-dir" and <alternates-object-dir-path>, the last one is
terminator for that case, but what is next to last NULL (at 5th place)
for?

> +		commit_graph_verify.argv = verify_argv;
> +		commit_graph_verify.git_cmd = 1;
> +
> +		if (run_command(&commit_graph_verify))
> +			errors_found |= ERROR_COMMIT_GRAPH;
> +
> +		prepare_alt_odb();
> +		for (alt = alt_odb_list; alt; alt = alt->next) {
> +			verify_argv[2] = "--object-dir";
> +			verify_argv[3] = alt->path;
> +			if (run_command(&commit_graph_verify))
> +				errors_found |= ERROR_COMMIT_GRAPH;
> +		}
> +	}

For performance reasons it may be better to start those 'git
commit-graph verify' commands asynchronously earlier, so that they can
run in parallel / concurrently wth other checks, and wait for them and
get their error code at the end of git-fsck run.

But that is probably better left for a separate commit.

> +
>  	return errors_found;
>  }
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 2680a2ebff..4941937163 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -394,4 +394,12 @@ test_expect_success 'detect invalid checksum hash' '
>  		"incorrect checksum"
>  '
>  
> +test_expect_success 'git fsck (checks commit-graph)' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git fsck &&
> +	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
> +		"incorrect checksum" &&
> +	test_must_fail git fsck
> +'

All right; though the same caveats apply as with previous commit in
series.  Perhaps it would be better to truncate commit-graph file, or
corrupt it in some 'random' place.

> +
>  test_done

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 18/20] commit-graph: add '--reachable' option
  2018-05-24 16:26       ` [PATCH v3 18/20] commit-graph: add '--reachable' option Derrick Stolee
@ 2018-06-02 17:34         ` Jakub Narebski
  2018-06-04 12:44           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 17:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> When writing commit-graph files, it can be convenient to ask for all
> reachable commits (starting at the ref set) in the resulting file. This
> is particularly helpful when writing to stdin is complicated, such as a
> future integration with 'git gc' which will call
> write_commit_graph_reachable() after performing cleanup of the object
> database.

Nice.

The last sentence of the commit message is a bit long, though, in my
opinion.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt |  8 ++++++--
>  builtin/commit-graph.c             | 16 ++++++++++++----
>  commit-graph.c                     | 32 ++++++++++++++++++++++++++++++++
>  commit-graph.h                     |  1 +
>  t/t5318-commit-graph.sh            | 10 ++++++++++
>  5 files changed, 61 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index a222cfab08..dececb79d7 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
>  +
>  With the `--stdin-packs` option, generate the new commit graph by
>  walking objects only in the specified pack-indexes. (Cannot be combined
> -with --stdin-commits.)
> +with `--stdin-commits` or `--reachable`.)
>  +
>  With the `--stdin-commits` option, generate the new commit graph by
>  walking commits starting at the commits specified in stdin as a list
>  of OIDs in hex, one OID per line. (Cannot be combined with
> ---stdin-packs.)
> +`--stdin-packs` or `--reachable`.)
> ++
> +With the `--reachable` option, generate the new commit graph by walking
> +commits starting at all refs. (Cannot be combined with `--stdin-commits`
> +or `--stdin-packs`.)

All right (though I wonder a bit about the restriction).

I think it might be a good idea to describe all of this in the usage
string for the 'git commit-graph write', instead of using '<options>'
placeholder, that is instead of current:

  'git commit-graph write' <options> [--object-dir <dir>]

use

  'git commit-graph write' [--stdin-commits | --stdin-packs | --reachable]
                           [--append] [--object-dir <dir>]

or something like that.

>  +
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 0433dd6e20..20ce6437ae 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph [--object-dir <objdir>]"),
>  	N_("git commit-graph read [--object-dir <objdir>]"),
>  	N_("git commit-graph verify [--object-dir <objdir>]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),

All right, very straightforward.  I guess they are put in [almost]
alphabetical order, or is there some other reasoning behind the ordering
used (which is different from the one in the manpage)?

>  	NULL
>  };
>  
> @@ -24,12 +24,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),

The same.

>  	NULL
>  };
>  
>  static struct opts_commit_graph {
>  	const char *obj_dir;
> +	int reachable;
>  	int stdin_packs;
>  	int stdin_commits;
>  	int append;
> @@ -130,6 +131,8 @@ static int graph_write(int argc, const char **argv)
>  		OPT_STRING(0, "object-dir", &opts.obj_dir,
>  			N_("dir"),
>  			N_("The object directory to store the graph")),
> +		OPT_BOOL(0, "reachable", &opts.reachable,
> +			N_("start walk at all refs")),

Errr... does '--no-reachable' makes sense?  Because if I am right
currently it is supported, isn't it.

>  		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
>  			N_("scan pack-indexes listed by stdin for commits")),
>  		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
> @@ -143,11 +146,16 @@ static int graph_write(int argc, const char **argv)
>  			     builtin_commit_graph_write_options,
>  			     builtin_commit_graph_write_usage, 0);
>  
> -	if (opts.stdin_packs && opts.stdin_commits)
> -		die(_("cannot use both --stdin-commits and --stdin-packs"));
> +	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)

Nice trick.

> +		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));

It is a pity that parseopt API does not have direct support for mutually
exclusive groups of boolean options, like ArgumentParser.add_mutually_exclusive_group()
in Python's argparse.

Still, you need to use what it is there.

>  	if (!opts.obj_dir)
>  		opts.obj_dir = get_object_directory();
>  
> +	if (opts.reachable) {
> +		write_commit_graph_reachable(opts.obj_dir, opts.append);
> +		return 0;
> +	}

Just using the option.

> +
>  	if (opts.stdin_packs || opts.stdin_commits) {
>  		struct strbuf buf = STRBUF_INIT;
>  		lines_nr = 0;
> diff --git a/commit-graph.c b/commit-graph.c
> index a33600c584..057d734926 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -6,6 +6,7 @@
>  #include "packfile.h"
>  #include "commit.h"
>  #include "object.h"
> +#include "refs.h"
>  #include "revision.h"
>  #include "sha1-lookup.h"
>  #include "commit-graph.h"
> @@ -651,6 +652,37 @@ static void compute_generation_numbers(struct packed_commit_list* commits)
>  	}
>  }
>  
> +struct hex_list {
> +	char **hex_strs;
> +	int hex_nr;
> +	int hex_alloc;
> +};

Is this what git-for-each-ref / git-branch / git-tag uses?

Would it be possible to use for example string-list API (documented in
string-list.h) instead?  Anyway, it looks like the use of allocation
growing API is simple enough... though perhaps it could be made simpler
by noticing that all strings have the same width.

> +
> +static int add_ref_to_list(const char *refname,
> +			   const struct object_id *oid,
> +			   int flags, void *cb_data)
> +{
> +	struct hex_list *list = (struct hex_list*)cb_data;
> +
> +	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
> +	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
> +	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));

Wouldn't it be better to use strdup or xstrdup instead of
xcalloc+strcpy?

> +	list->hex_nr++;
> +	return 0;
> +}
> +
> +void write_commit_graph_reachable(const char *obj_dir, int append)
> +{
> +	struct hex_list list;
> +	list.hex_nr = 0;
> +	list.hex_alloc = 128;
> +	ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
> +
> +	for_each_ref(add_ref_to_list, &list);
> +
> +	write_commit_graph(obj_dir, NULL, 0, (const char **)list.hex_strs, list.hex_nr, append);

Where do we free the allocated data and allocated strings?  If they are
cleaned by process exit, perhaps they need to be UNLEAK-ed?

> +}
> +
>  void write_commit_graph(const char *obj_dir,
>  			const char **pack_indexes,
>  			int nr_packs,
> diff --git a/commit-graph.h b/commit-graph.h
> index 71a39c5a57..9a06a5f188 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -46,6 +46,7 @@ struct commit_graph {
>  
>  struct commit_graph *load_commit_graph_one(const char *graph_file);
>  
> +void write_commit_graph_reachable(const char *obj_dir, int append);
>  void write_commit_graph(const char *obj_dir,
>  			const char **pack_indexes,
>  			int nr_packs,
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 4941937163..a659620332 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
>  graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
>  graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
>  
> +test_expect_success 'build graph using --reachable' '
> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit-graph write --reachable &&
> +	test_path_is_file $objdir/info/commit-graph &&
> +	graph_read_expect "11" "large_edges"
> +'

All right, here we check that commit-graph has expected features (11
commits, and large_edges optional chunk).

Perhaps we could also check that different equivalent ways of creating
serialized commit graph file produce byte-for-byte identical file?

> +
> +graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> +graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2

All right, this supposedly tests that behavior does not change whether
we are using or we are not using the commit-graph feature... but I have
just noticed that raph_git_two_modes() uses `git -c core.graph=true`;
shouldn't it be `git -c core.commitGraph=true`?

> +
>  test_expect_success 'setup bare repo' '
>  	cd "$TRASH_DIRECTORY" &&
>  	git clone --bare --no-local full bare &&

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 19/20] gc: automatically write commit-graph files
  2018-05-24 16:26       ` [PATCH v3 19/20] gc: automatically write commit-graph files Derrick Stolee
@ 2018-06-02 18:03         ` Jakub Narebski
  2018-06-04 12:51           ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 18:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph file is a very helpful feature for speeding up git
> operations. In order to make it more useful, write the commit-graph file
> by default during standard garbage collection operations.

I think you meant here "make it possible to write the commit-graph file
during standard garbage collection operations." (i.e. add "make it
possible" because it hides behind new config option, and remove "by
default" because currently it is not turned on by default).

>
> Add a 'gc.commitGraph' config setting that triggers writing a
> commit-graph file after any non-trivial 'git gc' command. Defaults to
> false while the commit-graph feature matures. We specifically do not
> want to turn this on by default until the commit-graph feature is fully

s/turn this on/have this on/  I think.

> integrated with history-modifying features like shallow clones.

Two things.

First, shallow clones, replacement mechanims (git-replace) and grafts
are not "history-modifying" features; this name is in my opinion
reserved for history-rewriting features such as interactive rebase, the
`git filter-branch` feature or out-of-tree BFG Repo Cleaner or
reposurgeon tools.  They alter the _view_ of history; they should be
IMVHO named "history-view-altering" features -- though I agree this is
mouthful.

Second, shouldn't we, as Martin Ågren said, warn about the issue in the
manpage for git-commit-graph?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/config.txt |  6 ++++++
>  Documentation/git-gc.txt |  4 ++++
>  builtin/gc.c             |  6 ++++++
>  t/t5318-commit-graph.sh  | 14 ++++++++++++++
>  4 files changed, 30 insertions(+)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 11f027194e..9a3abd87e7 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1553,6 +1553,12 @@ gc.autoDetach::
>  	Make `git gc --auto` return immediately and run in background
>  	if the system supports it. Default is true.
>  
> +gc.commitGraph::
> +	If true, then gc will rewrite the commit-graph file after any
> +	change to the object database. If '--auto' is used, then the
> +	commit-graph will not be updated unless the threshold is met.

What threshold?  Ah, thresholds defined for `git gc --auto` (gc.auto,
gc.autoPackLimit, gc.logExpiry,...).

> +	See linkgit:git-commit-graph[1] for details.

You missed declaring the default value for this config option.

> +
>  gc.logExpiry::
>  	If the file gc.log exists, then `git gc --auto` won't run
>  	unless that file is more than 'gc.logExpiry' old.  Default is
> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> index 571b5a7e3c..17dd654a59 100644
> --- a/Documentation/git-gc.txt
> +++ b/Documentation/git-gc.txt
> @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
>  it within all non-bare repos or it can be set to a boolean value.
>  This defaults to true.
>  
> +The optional configuration variable 'gc.commitGraph' determines if
> +'git gc' runs 'git commit-graph write'. This can be set to a boolean

Should it be "runs" or "should run"?

> +value. This defaults to false.

Should it be '...' or `...`?  Below we have `gc.aggresiveWindow`, above
we have 'gc.commitGraph', for example.

> +
>  The optional configuration variable `gc.aggressiveWindow` controls how
>  much time is spent optimizing the delta compression of the objects in
>  the repository when the --aggressive option is specified.  The larger
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 77fa720bd0..efd214a59f 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -20,6 +20,7 @@
>  #include "argv-array.h"
>  #include "commit.h"
>  #include "packfile.h"
> +#include "commit-graph.h"
>  
>  #define FAILED_RUN "failed to run %s"
>  
> @@ -34,6 +35,7 @@ static int aggressive_depth = 50;
>  static int aggressive_window = 250;
>  static int gc_auto_threshold = 6700;
>  static int gc_auto_pack_limit = 50;
> +static int gc_commit_graph = 0;
>  static int detach_auto = 1;
>  static timestamp_t gc_log_expire_time;
>  static const char *gc_log_expire = "1.day.ago";
> @@ -121,6 +123,7 @@ static void gc_config(void)
>  	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
>  	git_config_get_int("gc.auto", &gc_auto_threshold);
>  	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
> +	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
>  	git_config_get_bool("gc.autodetach", &detach_auto);
>  	git_config_get_expiry("gc.pruneexpire", &prune_expire);
>  	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
> @@ -480,6 +483,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (pack_garbage.nr > 0)
>  		clean_pack_garbage();
>  
> +	if (gc_commit_graph)
> +		write_commit_graph_reachable(get_object_directory(), 0);
> +

Nice.

Though now I wonder when appending should be used...

>  	if (auto_gc && too_many_loose_objects())
>  		warning(_("There are too many unreachable loose objects; "
>  			"run 'git prune' to remove them."));
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index a659620332..d20b17586f 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full repo' '
>  	test_cmp expect output
>  '
>  
> +test_expect_success 'check that gc clears commit-graph' '

I wouldn't use the word "clears" here...

> +	cd "$TRASH_DIRECTORY/full" &&
> +	git commit --allow-empty -m "blank" &&
> +	git commit-graph write --reachable &&
> +	cp $objdir/info/commit-graph commit-graph-before-gc &&
> +	git reset --hard HEAD~1 &&
> +	git config gc.commitGraph true &&
> +	git gc &&
> +	cp $objdir/info/commit-graph commit-graph-after-gc &&
> +	! test_cmp commit-graph-before-gc commit-graph-after-gc &&
> +	git commit-graph write --reachable &&
> +	test_cmp commit-graph-after-gc $objdir/info/commit-graph
> +'

...but otherwise, nice test: it checks that git-gc after rewriting
history changes commit-graph file, and that the changed file is what we
expect it to be (note: here we compare commit-graph files directly, and
not just check the features via 'git commit-graph read').

> +
>  # the verify tests below expect the commit-graph to contain
>  # exactly the commits reachable from the commits/8 branch.
>  # If the file changes the set of commits in the list, then the

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 20/20] commit-graph: update design document
  2018-05-24 16:26       ` [PATCH v3 20/20] commit-graph: update design document Derrick Stolee
@ 2018-06-02 18:27         ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 18:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git\, gitster\, stolee\, avarab\, marten.agren\, peff\

Derrick Stolee <dstolee@microsoft.com> writes:

> The commit-graph feature is now integrated with 'fsck' and 'gc',
> so remove those items from the "Future Work" section of the
> commit-graph design document.

It is always nice to have such commit as a summary what was done in the
series, and to have up to date roadmap.

>
> Also remove the section on lazy-loading trees, as that was completed
> in an earlier patch series.

Admittedly, this part could have been sent in a separate patch at the
start of the series, but it doesn't matter at all; no need for extra
work.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 22 ----------------------
>  1 file changed, 22 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index e1a883eb46..c664acbd76 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -118,9 +118,6 @@ Future Work
>  - The commit graph feature currently does not honor commit grafts. This can
>    be remedied by duplicating or refactoring the current graft logic.
>  
> -- The 'commit-graph' subcommand does not have a "verify" mode that is
> -  necessary for integration with fsck.
> -

All right.

>  - After computing and storing generation numbers, we must make graph
>    walks aware of generation numbers to gain the performance benefits they
>    enable. This will mostly be accomplished by swapping a commit-date-ordered
> @@ -130,25 +127,6 @@ Future Work
>      - 'log --topo-order'
>      - 'tag --merged'
>  
> -- Currently, parse_commit_gently() requires filling in the root tree
> -  object for a commit. This passes through lookup_tree() and consequently
> -  lookup_object(). Also, it calls lookup_commit() when loading the parents.
> -  These method calls check the ODB for object existence, even if the
> -  consumer does not need the content. For example, we do not need the
> -  tree contents when computing merge bases. Now that commit parsing is
> -  removed from the computation time, these lookup operations are the
> -  slowest operations keeping graph walks from being fast. Consider
> -  loading these objects without verifying their existence in the ODB and
> -  only loading them fully when consumers need them. Consider a method
> -  such as "ensure_tree_loaded(commit)" that fully loads a tree before
> -  using commit->tree.

All right, this is about the change done in previous series.

> -
> -- The current design uses the 'commit-graph' subcommand to generate the graph.
> -  When this feature stabilizes enough to recommend to most users, we should
> -  add automatic graph writes to common operations that create many commits.
> -  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
> -  commands.

All right; actually it was done by augmenting 'gc' instead.

> -
>  - A server could provide a commit graph file as part of the network protocol
>    to avoid extra calculations by clients. This feature is only of benefit if
>    the user is willing to trust the file, because verifying the file is correct

Good work,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE
  2018-05-26 20:30           ` brian m. carlson
@ 2018-06-02 19:43             ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 19:43 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Derrick Stolee, git, gitster\, stolee\, avarab\, marten.agren\, peff\

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On Sat, May 26, 2018 at 08:46:09PM +0200, Jakub Narebski wrote:
>> One issue: in the future when Git moves to NewHash, it could encounter
>> then both commit-graph files using SHA-1 and using NewHash.  What about
>> GRPH_OID_LEN then: for one of those it would be incorrect.  Unless it is
>> about minimal length of checksum, that is we assume that NewHash would
>> be longer than SHA-1, but ten why name it GRAPH_OID_LEN?
>
> My proposal is that whatever we're using in the .git directory is
> consistent.  If we're using SHA-1 for objects, then everything is SHA-1.
> If we're using NewHash for objects, then all data is stored in NewHash
> (except translation tables and such).  Any conversions between SHA-1 and
> NewHash require a lookup through the standard techniques.
>
> I agree that here it would be more helpful if it were a reference to
> the_hash_algo, and I've applied a patch to my object-id-part14 series to
> make that conversion.

All right, I can agree that it would make most sense to always use SHA-1
for OID, or always use NewHash for objects.  This would make
commit-graph file with SHA-1 hash invalid for NewHash-using Git version.

It would be nice, however, to avoid having to redo all the hard work,
like calculating generation numbers (from old commit-graph file, or from
server that does not support NewHash yet -- the latter is not
implemented, but IIUC planned feature).  But we can do it with explicit
conversion step, e.g. 'git commit-graph convert' or 'upgrade'.

But all that is in the future.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand
  2018-05-30 16:07           ` Derrick Stolee
@ 2018-06-02 21:19             ` Jakub Narebski
  2018-06-04 11:30               ` Derrick Stolee
  0 siblings, 1 reply; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 21:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, gitster\, avarab\, marten.agren\, peff\

Derrick Stolee <stolee@gmail.com> writes:
> On 5/27/2018 6:55 PM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:
[...]
>>> +static int verify_commit_graph_error;
>>> +
>>> +static void graph_report(const char *fmt, ...)
>>> +{
>>> +	va_list ap;
>>> +	struct strbuf sb = STRBUF_INIT;
>>> +	verify_commit_graph_error = 1;
>>> +
>>> +	va_start(ap, fmt);
>>> +	strbuf_vaddf(&sb, fmt, ap);
>>> +
>>> +	fprintf(stderr, "%s\n", sb.buf);
>>> +	strbuf_release(&sb);
>>> +	va_end(ap);
>>
>> Why do you use strbuf_vaddf + fprintf instead of straighforward
>> vfprintf (or function instead of variable-level macro)?
>>
>> Is it because of [string] safety?
>
> It's because I've never used this variable-parameter thing before and
> found a different example.
>
> I'll use vfprintf() in v4, as it is simpler.

All right, if it is not dangerous, then simpler is better.

Sidenote: such error messaging is often handled by variadic macros,
e.g.:

  #define eprintf(...) fprintf(stderr, __VA_ARGS__)

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index 96cccb10f3..71a39c5a57 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
>>>   			int nr_commits,
>>>   			int append);
>>>   +int verify_commit_graph(struct commit_graph *g);
>>> +
>> Why does this need to be exported?  I think it is not used outside of
>> commit-graph.c, isn't it?
>
> Used by builtin/commit-graph.c

Ah, true.

[...]
>>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>>> index 77d85aefe7..6ca451dfd2 100755
>>> --- a/t/t5318-commit-graph.sh
>>> +++ b/t/t5318-commit-graph.sh
>>> @@ -11,6 +11,11 @@ test_expect_success 'setup full repo' '
>>>   	objdir=".git/objects"
>>>   '
>>> +test_expect_success 'verify graph with no graph file' '
>>> +	cd "$TRASH_DIRECTORY/full" &&
>>> +	git commit-graph verify
>>> +'
>>> +
>>>   test_expect_success 'write graph with no packs' '
>>>   	cd "$TRASH_DIRECTORY/full" &&
>>>   	git commit-graph write --object-dir . &&
>>> @@ -230,4 +235,9 @@ test_expect_success 'perform fast-forward merge in full repo' '
>>>   	test_cmp expect output
>>>   '
>>>   +test_expect_success 'git commit-graph verify' '
>>> +	cd "$TRASH_DIRECTORY/full" &&
>>> +	git commit-graph verify >output
>>> +'
>> Those are tests with nearly the same code, but they are (by their
>> descriptions) testing different things.  This means that they rely on
>> side effects of earlier tests.
>>
>> This is suboptimal, as it means that it would be impossible or very
>> difficult to run individual tests (e.g. with GIT_SKIP_TESTS environment
>> variable, or with an individual test suite --run option), unless you
>> know which tests setup the repository state for later tests.
>>
>> It also means that running only failed tests with prove
>> --state=failed,save or equivalently with
>>
>>    $ make DEFAULT_TEST_TARGET=prove GIT_PROVE_OPTS='--state=failed,save' test
>>
>> wouldn't work correctly.
>>
>> As Johannes Schindelin (alias Dscho) said in latest Git Rev News
>> interview: https://git.github.io/rev_news/2018/05/16/edition-39/
>>
>> JS> We have a test suite where debugging a regression may mean that you
>> JS> have to run 98 test cases before the failing one every single time in
>> JS> the edit/compile/debug cycle, because the 99th test case may depend on
>> JS> a side effect of at least one of the preceding test cases. Git’s test
>> JS> suite is so not [21st century best practices][1].
>> JS>
>> JS> [1]: https://www.slideshare.net/BuckHodges/lessons-learned-doing-devops-at-scale-at-microsoft
>>
>>
>> I think can be solved quite efficiently by creating and using shell
>> function, or two shell functions, which would either:
>>
>>   * rename commit-graph file to some other temporary name if it exists,
>>     and move it back after the test.
>>   * create commit-graph file if it does not exist.
>>
>> For example (untested):
>>
>>    prepare_no_commit_graph() {
>>    	mv .git/info/commit-graph .git/info/commit-graph.away &&
>>    	test_when_finished "mv .git/info/commit-graph.away .git/info/commit-graph"
>>    }
>>
>>    prepare_commit_graph() {
>>    	if ! test -f ".git/info/commit-graph"
>>    	then
>>    		git commit-graph write
>>    	fi
>>    }
>>
>> Or something like that.
>
> Do we have a way to run individual steps of the test suite? I am
> unfamiliar with that process.

The t/README describes three such ways in "Skipping Tests" section:

- GIT_SKIP_TESTS environment variable, which can either can match the
  "t[0-9]{4}" part to skip the whole test, or t[0-9]{4} followed by
  ".$number" to say which particular test to skip

- For an individual test suite --run could be used to specify that
  only some tests should be run or that some tests should be
  excluded from a run (the latter with '!' prefix).

- 'prove' harness can also run individual tests; one of more useful
  options is --state, which for example would allow to run only failed
  tests with --state=failed,save ... if the tests were independent.

>
> Adding the complexity of storing a copy of the commit-graph file for
> re-use in a later test is wasted energy right now, because we need to
> run the steps of the test that create the repo shape with the commits
> laid out as set earlier in the test. This shape changes as we test
> different states of the commit-graph (exists and contains all commits,
> exists and doesn't contain all commits, etc.)

I think we can solve most of the problem by separating validation tests
(which all or almost all use the same commit-graph file) and other test;
putting them in different test scripts.  This means that the more
complicated case would be limited to the subset of tests.

Anyway, if the setup stages are clearly separated and clearly marked as
such, we would be able to at least manually skip tests, or manually run
only a subset of tests.

Test independence is certainly something nice to have, but as the git
testsuite is not in best shape wrt this, it is not a requirement.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 07/20] commit-graph: verify catches corrupt signature
  2018-05-29 12:43           ` Derrick Stolee
@ 2018-06-02 22:30             ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 22:30 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, gitster\, avarab\, marten.agren\, peff\

Derrick Stolee <stolee@gmail.com> writes:
> On 5/28/2018 10:05 AM, Jakub Narebski wrote:
>> Derrick Stolee <dstolee@microsoft.com> writes:

[...]
>>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>>> index 6ca451dfd2..bd64481c7a 100755
>>> --- a/t/t5318-commit-graph.sh
>>> +++ b/t/t5318-commit-graph.sh
>>> @@ -235,9 +235,52 @@ test_expect_success 'perform fast-forward merge in full repo' '
>>>   	test_cmp expect output
>>>   '
>>>   +# the verify tests below expect the commit-graph to contain
>>> +# exactly the commits reachable from the commits/8 branch.
>>> +# If the file changes the set of commits in the list, then the
>>> +# offsets into the binary file will result in different edits
>>> +# and the tests will likely break.
>>> +
>>>   test_expect_success 'git commit-graph verify' '
>>>   	cd "$TRASH_DIRECTORY/full" &&
>>> +	git rev-parse commits/8 | git commit-graph write --stdin-commits &&
>>>   	git commit-graph verify >output
>>>   '
>> I don't quite understand what the change is meant to do.
>
> This gives us a constant commit-graph file to work with in the later tests.
>
> To get the "independent test" structure you want for the tests that
> are coming, we need to do one of the following:
>
> 1. Write a new commit-graph file for every test (slows things down).

Or check if correct graph-file exists, and if it doesn't only then write
a new commit-graph file (like I have proposed elsewhere in this thread).

Barring this, I think it would be better if the preparation step was
separated into a 'setup <something>' step, so that one can easier select
which tests to run, at least by hand.

> 2. Do all corruption/verify checks in a single test (reduces the
> information from a failed test, as it only reports the first failure).
>
> I don't like either of these options, so I went with this "prepare" step.

These are not the only possible options.

>> Also, as I said earlier, I'd prefer if tests were as indpendent of each
>> other as possible, to make running individual tests (e.g. only
>> previously falling tests) easier.
>>
>> I especially do not like mixing running actual test with setting up the
>> repository for future tests, as here.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 11/20] commit-graph: verify root tree OIDs
  2018-05-31 13:16           ` Derrick Stolee
@ 2018-06-02 22:50             ` Jakub Narebski
  0 siblings, 0 replies; 149+ messages in thread
From: Jakub Narebski @ 2018-06-02 22:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git\, gitster\, avarab\, marten.agren\, peff\

Derrick Stolee <stolee@gmail.com> writes:
> On 5/30/2018 6:24 PM, Jakub Narebski wrote:

[...]
>> NOTE: we will be checking Commit Data chunk; I think it would be good
>> idea to verify that size of Commit Data chunk matches (N * (H + 16) bytes)
>> that format gives us, so that we don't accidentally red outside of
>> memory if something got screwed up (like number of commits being wrong,
>> or file truncated).
>
> This is actually how we calculate 'num_commits' during
> load_commit_graph_one():
>
>     if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
>     {
>         graph->num_commits = (chunk_offset - last_chunk_offset)
>                              / graph->hash_len;
>     }
>
> So, if the chunk doesn't match N*(H+16), we detect this because
> FANOUT[255] != N.
>
> (There is one caveat here: (chunk_offset - last_chunk_offset) may not
> be a multiple of hash_len, and "accidentally" truncate to N in the
> division. I'll add more careful checks for this.)

I have thought for some reason that number of commits N was stored
somewhere directly in the commit-graph header.

Anyway we have three places that we can calculate (or simply read in
case of OID Fanour chunk) the number of commits:
 - FANOUT[255] == N
 - OID Lookup size = (N * H bytes)
   - N = (OID Lookup size) / hash_len
   - (OID Lookup size) % hash_len == 0
 - Commit Data size = (N * (H + 16) bytes)
   - N = (Commir Data size) / (hash_len + 16)
   - (Commir Data size) % (hash_len + 16) == 0

>
> We also check out-of-bounds offsets in that method.

Good.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 06/20] commit-graph: add 'verify' subcommand
  2018-06-02 21:19             ` Jakub Narebski
@ 2018-06-04 11:30               ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 11:30 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Derrick Stolee, git, gitster, avarab, marten.agren, peff

On 6/2/2018 5:19 PM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> Do we have a way to run individual steps of the test suite? I am
>> unfamiliar with that process.
> The t/README describes three such ways in "Skipping Tests" section:
>
> - GIT_SKIP_TESTS environment variable, which can either can match the
>    "t[0-9]{4}" part to skip the whole test, or t[0-9]{4} followed by
>    ".$number" to say which particular test to skip
>
> - For an individual test suite --run could be used to specify that
>    only some tests should be run or that some tests should be
>    excluded from a run (the latter with '!' prefix).
>
> - 'prove' harness can also run individual tests; one of more useful
>    options is --state, which for example would allow to run only failed
>    tests with --state=failed,save ... if the tests were independent.
>
>> Adding the complexity of storing a copy of the commit-graph file for
>> re-use in a later test is wasted energy right now, because we need to
>> run the steps of the test that create the repo shape with the commits
>> laid out as set earlier in the test. This shape changes as we test
>> different states of the commit-graph (exists and contains all commits,
>> exists and doesn't contain all commits, etc.)
> I think we can solve most of the problem by separating validation tests
> (which all or almost all use the same commit-graph file) and other test;
> putting them in different test scripts.  This means that the more
> complicated case would be limited to the subset of tests.
>
> Anyway, if the setup stages are clearly separated and clearly marked as
> such, we would be able to at least manually skip tests, or manually run
> only a subset of tests.
>
> Test independence is certainly something nice to have, but as the git
> testsuite is not in best shape wrt this, it is not a requirement.

I'm all for making the test suite better. In this case, I will hold that 
for a later series (that is entirely focused on that feature) as I 
expect we will want to discuss the correct pattern in detail.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-06-02  4:38         ` Duy Nguyen
@ 2018-06-04 11:32           ` Derrick Stolee
  2018-06-04 14:42             ` Duy Nguyen
  0 siblings, 1 reply; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 11:32 UTC (permalink / raw)
  To: Duy Nguyen, Derrick Stolee
  Cc: git, gitster, jnareb, avarab, marten.agren, peff

On 6/2/2018 12:38 AM, Duy Nguyen wrote:
> On Thu, May 24, 2018 at 6:25 PM, Derrick Stolee <dstolee@microsoft.com> wrote:
>> +               if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
>> +                       graph_report("commit-graph has incorrect OID order: %s then %s",
>> +                                    oid_to_hex(&prev_oid),
>> +                                    oid_to_hex(&cur_oid));
> Should these strings be marked for translation with _()?
I've been asking myself "Is this message helpful to anyone other than a 
Git developer?" and for this series the only one that is helpful to an 
end-user is the message about the final hash. If the hash is correct, 
but these other messages appear, then there is a bug in the code that 
wrote the file. Otherwise, file corruption is more likely and the 
correct course of action is to delete and rebuild.

Thanks for being diligent in checking.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 13/20] commit-graph: verify generation number
  2018-06-02 12:23         ` Jakub Narebski
@ 2018-06-04 11:47           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 11:47 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 8:23 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> While iterating through the commit parents, perform the generation
>> number calculation and compare against the value stored in the
>> commit-graph.
> All right, that's good.
>
> What about commit-graph files that have GENERATION_NUMBER_ZERO for all
> its commits (because we verify single commit-graph file only, there
> wouldn't be GENERATION_NUMBER_ZERO mixed with non-zero generation
> numbers)?
>
> Unless we can assume that no commit-graph file in the wild would have
> GENERATION_NUMBER_ZERO.

I was expecting that we would not have files in the wild with 
GENERATION_NUMBER_ZERO, but it looks like 2.18 will create files like 
that. I'll put in logic to verify "all are GENERATION_NUMBER_ZERO or all 
have 'correct' generation number".

>
>> The tests demonstrate that having a different set of parents affects
>> the generation number calculation, and this value propagates to
>> descendants. Hence, we drop the single-line condition on the output.
> I don't understand what part of changes this paragraph of the commit
> message refers to.
>
> Anyway, changing parents may not lead to changed generation numbers;
> take for example commit with single parent, which we change to other
> commit with the same generation number.

The tests introduced in the previous commit change the parent list, 
which then changes the generation number in some cases (the stored 
generation number doesn't match the generation number computed based on 
the loaded parents). Since we report as many errors as possible (instead 
of failing on first error) those tests would fail if we say "the _only_ 
error should be the parent list". (This comes up again when we report 
the hash is incorrect, which would appear in every test.)

The test introduced in this commit only changes the generation number, 
so that test will have the one error.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c          | 18 ++++++++++++++++++
>>   t/t5318-commit-graph.sh |  6 ++++++
> Sidenote: I have just realized that it may be better to put
> validation-related tests into different test file.
>
>>   2 files changed, 24 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index fff22dc0c3..ead92460c1 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -922,6 +922,7 @@ int verify_commit_graph(struct commit_graph *g)
>>   	for (i = 0; i < g->num_commits; i++) {
>>   		struct commit *graph_commit, *odb_commit;
>>   		struct commit_list *graph_parents, *odb_parents;
>> +		uint32_t max_generation = 0;
>>   
>>   		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
>>   
>> @@ -956,6 +957,9 @@ int verify_commit_graph(struct commit_graph *g)
>>   					     oid_to_hex(&graph_parents->item->object.oid),
>>   					     oid_to_hex(&odb_parents->item->object.oid));
>>   
>> +			if (graph_parents->item->generation > max_generation)
>> +				max_generation = graph_parents->item->generation;
>> +
> All right, that calculates the maximum of generation number of commit
> parents.
>
>>   			graph_parents = graph_parents->next;
>>   			odb_parents = odb_parents->next;
>>   		}
>> @@ -963,6 +967,20 @@ int verify_commit_graph(struct commit_graph *g)
>>   		if (odb_parents != NULL)
>>   			graph_report("commit-graph parent list for commit %s terminates early",
>>   				     oid_to_hex(&cur_oid));
>> +
>> +		/*
>> +		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
>> +		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
>> +		 * extra logic in the following condition.
>> +		 */
> Nice trick.
>
>> +		if (max_generation == GENERATION_NUMBER_MAX)
>> +			max_generation--;
> What about GENERATION_NUMBER_ZERO?
>
>> +
>> +		if (graph_commit->generation != max_generation + 1)
>> +			graph_report("commit-graph generation for commit %s is %u != %u",
>> +				     oid_to_hex(&cur_oid),
>> +				     graph_commit->generation,
>> +				     max_generation + 1);
> I think we should also check that generation numbers do not exceed
> GENERATION_NUMBER_MAX... unless it is already taken care of?

We get that for free. First, the condition above would fail for at least 
one commit. Second, we literally cannot store a value larger than 
GENERATION_NUMBER_MAX in the commit-graph as there are only 30 bits 
dedicated to the generation number.

>
>>   	}
>>   
>>   	return verify_commit_graph_error;
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 12f0d7f54d..673b0d37d5 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -272,6 +272,7 @@ GRAPH_BYTE_COMMIT_TREE=$GRAPH_COMMIT_DATA_OFFSET
>>   GRAPH_BYTE_COMMIT_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN`
>>   GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
>>   GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
>> +GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -366,4 +367,9 @@ test_expect_success 'detect incorrect tree OID' '
>>   		"commit-graph parent for"
>>   '
>>   
>> +test_expect_success 'detect incorrect generation number' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_COMMIT_GENERATION "\01" \
> I assume that you have checked that it actually corrupts generation
> number (without affecting commit date).
>
>> +		"generation"
> A very minor nitpick: Not "generation for commit"?
>
>> +'
>> +
>>   test_done


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 16/20] commit-graph: verify contents match checksum
  2018-06-02 15:52         ` Jakub Narebski
@ 2018-06-04 11:55           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 11:55 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 11:52 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The commit-graph file ends with a SHA1 hash of the previous contents. If
>> a commit-graph file has errors but the checksum hash is correct, then we
>> know that the problem is a bug in Git and not simply file corruption
>> after-the-fact.
>>
>> Compute the checksum right away so it is the first error that appears,
>> and make the message translatable since this error can be "corrected" by
>> a user by simply deleting the file and recomputing. The rest of the
>> errors are useful only to developers.
> Should we then provide --quiet / --verbose options, so that ordinary
> user is not flooded with error messages meant for power users and Git
> developers, then?
>
>> Be sure to continue checking the rest of the file data if the checksum
>> is wrong. This is important for our tests, as we break the checksum as
>> we modify bytes of the commit-graph file.
> Well, we could have used sha1sum program, or test-sha1 helper to fix the
> checksum after corrupting the commit-graph file...
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   commit-graph.c          | 16 ++++++++++++++--
>>   t/t5318-commit-graph.sh |  6 ++++++
>>   2 files changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index d2b291aca2..a33600c584 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -841,6 +841,7 @@ void write_commit_graph(const char *obj_dir,
>>   	oids.nr = 0;
>>   }
>>   
>> +#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
>>   static int verify_commit_graph_error;
>>   
>>   static void graph_report(const char *fmt, ...)
>> @@ -860,7 +861,9 @@ static void graph_report(const char *fmt, ...)
>>   int verify_commit_graph(struct commit_graph *g)
>>   {
>>   	uint32_t i, cur_fanout_pos = 0;
>> -	struct object_id prev_oid, cur_oid;
>> +	struct object_id prev_oid, cur_oid, checksum;
>> +	struct hashfile *f;
>> +	int devnull;
>>   
>>   	if (!g) {
>>   		graph_report("no commit-graph file loaded");
>> @@ -879,6 +882,15 @@ int verify_commit_graph(struct commit_graph *g)
>>   	if (verify_commit_graph_error)
>>   		return verify_commit_graph_error;
>>   
>> +	devnull = open("/dev/null", O_WRONLY);
>> +	f = hashfd(devnull, NULL);
>> +	hashwrite(f, g->data, g->data_len - g->hash_len);
>> +	finalize_hashfile(f, checksum.hash, CSUM_CLOSE);
>> +	if (hashcmp(checksum.hash, g->data + g->data_len - g->hash_len)) {
>> +		graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));
>> +		verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
>> +	}
> Is it the best way of calculating the SHA-1 checksum that out internal
> APIs provide?  Is it how SHA-1 checksum is calculated and checked for
> packfiles?
This pattern is similar to hashfd_check() in csum-file.c, except we are 
hashing the file data directly instead of re-creating it from scratch 
(as is done for 'git index-pack --verify').

>
>> +
>>   	for (i = 0; i < g->num_commits; i++) {
>>   		struct commit *graph_commit;
>>   
>> @@ -916,7 +928,7 @@ int verify_commit_graph(struct commit_graph *g)
>>   		cur_fanout_pos++;
>>   	}
>>   
>> -	if (verify_commit_graph_error)
>> +	if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH)
>>   		return verify_commit_graph_error;
> So if we detected that checksum do not match, but we have not found an
> error, we say that it is all right?

This only prevents us from stopping early. We will still report an error.

>
>>   
>>   	for (i = 0; i < g->num_commits; i++) {
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 240aef6add..2680a2ebff 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -279,6 +279,7 @@ GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
>>   GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
>>   				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
>>   GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
>> +GRAPH_BYTE_FOOTER=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4 \* $NUM_OCTOPUS_EDGES`
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -388,4 +389,9 @@ test_expect_success 'detect incorrect parent for octopus merge' '
>>   		"invalid parent"
>>   '
>>   
>> +test_expect_success 'detect invalid checksum hash' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
>> +		"incorrect checksum"
> This would not work under GETTEXT_POISON, as the message is marked as
> translatable, but corrupt_graph_and_verify uses 'grep' and not
> 'test_i18grep' from t/test-lib-functions.sh

I fixed this locally based on Szeder's comment.

>
>> +'
> If it is pure checksum corruption, wouldn't this fail because it is not
> a failure (exit code is 0)?

It is not zero, so the test passes.


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 17/20] fsck: verify commit-graph
  2018-06-02 16:17         ` Jakub Narebski
@ 2018-06-04 11:59           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 11:59 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 12:17 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> If core.commitGraph is true, verify the contents of the commit-graph
>> during 'git fsck' using the 'git commit-graph verify' subcommand. Run
>> this check on all alternates, as well.
> All right, so we have one config variable to control the use of
> serialized commit-graph feaature.  Nice.
>
>> We use a new process for two reasons:
>>
>> 1. The subcommand decouples the details of loading and verifying a
>>     commit-graph file from the other fsck details.
> All right, I can agree with that.
>
> On the other hand using subcommand makes debugging harder, though not in
> this case (well separated functionality that can be easily called with a
> standalone command to be debugged).
>
>> 2. The commit-graph verification requires the commits to be loaded
>>     in a specific order to guarantee we parse from the commit-graph
>>     file for some objects and from the object database for others.
> I don't quite understand this.  Could you explain it in more detail?

We use `lookup_commit()` when verifying the commit-graph. If these 
commits were loaded earlier in the process and parsed directly from the 
object database, then we aren't comparing the commit-graph file contents 
against the ODB.

>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-fsck.txt |  3 +++
>>   builtin/fsck.c             | 21 +++++++++++++++++++++
>>   t/t5318-commit-graph.sh    |  8 ++++++++
>>   3 files changed, 32 insertions(+)
>>
>> diff --git a/Documentation/git-fsck.txt b/Documentation/git-fsck.txt
>> index b9f060e3b2..ab9a93fb9b 100644
>> --- a/Documentation/git-fsck.txt
>> +++ b/Documentation/git-fsck.txt
>> @@ -110,6 +110,9 @@ Any corrupt objects you will have to find in backups or other archives
>>   (i.e., you can just remove them and do an 'rsync' with some other site in
>>   the hopes that somebody else has the object you have corrupted).
>>   
>> +If core.commitGraph is true, the commit-graph file will also be inspected
> Shouldn't we use `core.commitGraph` here?
>
>> +using 'git commit-graph verify'. See linkgit:git-commit-graph[1].
>> +
>>   Extracted Diagnostics
>>   ---------------------
>>   
>> diff --git a/builtin/fsck.c b/builtin/fsck.c
>> index ef78c6c00c..a6d5045b77 100644
>> --- a/builtin/fsck.c
>> +++ b/builtin/fsck.c
>> @@ -16,6 +16,7 @@
>>   #include "streaming.h"
>>   #include "decorate.h"
>>   #include "packfile.h"
>> +#include "run-command.h"
>>   
>>   #define REACHABLE 0x0001
>>   #define SEEN      0x0002
>> @@ -45,6 +46,7 @@ static int name_objects;
>>   #define ERROR_REACHABLE 02
>>   #define ERROR_PACK 04
>>   #define ERROR_REFS 010
>> +#define ERROR_COMMIT_GRAPH 020
> Minor nitpick and a sidenote: I wonder if it wouldn't be better to
> either use hexadecimal constants, or use (1 << n) for all ERROR_*
> preprocesor constants.
>
>>   
>>   static const char *describe_object(struct object *obj)
>>   {
>> @@ -815,5 +817,24 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>>   	}
>>   
>>   	check_connectivity();
>> +
>> +	if (core_commit_graph) {
>> +		struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
>> +		const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL, NULL };
> I see that NULL at index 2 and 3 (at 3rd and 4th place) are here for
> "--object-dir" and <alternates-object-dir-path>, the last one is
> terminator for that case, but what is next to last NULL (at 5th place)
> for?
>
>> +		commit_graph_verify.argv = verify_argv;
>> +		commit_graph_verify.git_cmd = 1;
>> +
>> +		if (run_command(&commit_graph_verify))
>> +			errors_found |= ERROR_COMMIT_GRAPH;
>> +
>> +		prepare_alt_odb();
>> +		for (alt = alt_odb_list; alt; alt = alt->next) {
>> +			verify_argv[2] = "--object-dir";
>> +			verify_argv[3] = alt->path;
>> +			if (run_command(&commit_graph_verify))
>> +				errors_found |= ERROR_COMMIT_GRAPH;
>> +		}
>> +	}
> For performance reasons it may be better to start those 'git
> commit-graph verify' commands asynchronously earlier, so that they can
> run in parallel / concurrently wth other checks, and wait for them and
> get their error code at the end of git-fsck run.
>
> But that is probably better left for a separate commit.
>
>> +
>>   	return errors_found;
>>   }
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 2680a2ebff..4941937163 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -394,4 +394,12 @@ test_expect_success 'detect invalid checksum hash' '
>>   		"incorrect checksum"
>>   '
>>   
>> +test_expect_success 'git fsck (checks commit-graph)' '
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	git fsck &&
>> +	corrupt_graph_and_verify $GRAPH_BYTE_FOOTER "\00" \
>> +		"incorrect checksum" &&
>> +	test_must_fail git fsck
>> +'
> All right; though the same caveats apply as with previous commit in
> series.  Perhaps it would be better to truncate commit-graph file, or
> corrupt it in some 'random' place.
>
>> +
>>   test_done
> Best,


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 18/20] commit-graph: add '--reachable' option
  2018-06-02 17:34         ` Jakub Narebski
@ 2018-06-04 12:44           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 12:44 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 1:34 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> When writing commit-graph files, it can be convenient to ask for all
>> reachable commits (starting at the ref set) in the resulting file. This
>> is particularly helpful when writing to stdin is complicated, such as a
>> future integration with 'git gc' which will call
>> write_commit_graph_reachable() after performing cleanup of the object
>> database.
> Nice.
>
> The last sentence of the commit message is a bit long, though, in my
> opinion.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/git-commit-graph.txt |  8 ++++++--
>>   builtin/commit-graph.c             | 16 ++++++++++++----
>>   commit-graph.c                     | 32 ++++++++++++++++++++++++++++++++
>>   commit-graph.h                     |  1 +
>>   t/t5318-commit-graph.sh            | 10 ++++++++++
>>   5 files changed, 61 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index a222cfab08..dececb79d7 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -38,12 +38,16 @@ Write a commit graph file based on the commits found in packfiles.
>>   +
>>   With the `--stdin-packs` option, generate the new commit graph by
>>   walking objects only in the specified pack-indexes. (Cannot be combined
>> -with --stdin-commits.)
>> +with `--stdin-commits` or `--reachable`.)
>>   +
>>   With the `--stdin-commits` option, generate the new commit graph by
>>   walking commits starting at the commits specified in stdin as a list
>>   of OIDs in hex, one OID per line. (Cannot be combined with
>> ---stdin-packs.)
>> +`--stdin-packs` or `--reachable`.)
>> ++
>> +With the `--reachable` option, generate the new commit graph by walking
>> +commits starting at all refs. (Cannot be combined with `--stdin-commits`
>> +or `--stdin-packs`.)
> All right (though I wonder a bit about the restriction).
>
> I think it might be a good idea to describe all of this in the usage
> string for the 'git commit-graph write', instead of using '<options>'
> placeholder, that is instead of current:
>
>    'git commit-graph write' <options> [--object-dir <dir>]
>
> use
>
>    'git commit-graph write' [--stdin-commits | --stdin-packs | --reachable]
>                             [--append] [--object-dir <dir>]
>
> or something like that.
>
>>   +
>>   With the `--append` option, include all commits that are present in the
>>   existing commit-graph file.
>> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
>> index 0433dd6e20..20ce6437ae 100644
>> --- a/builtin/commit-graph.c
>> +++ b/builtin/commit-graph.c
>> @@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
>>   	N_("git commit-graph [--object-dir <objdir>]"),
>>   	N_("git commit-graph read [--object-dir <objdir>]"),
>>   	N_("git commit-graph verify [--object-dir <objdir>]"),
>> -	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>> +	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
> All right, very straightforward.  I guess they are put in [almost]
> alphabetical order, or is there some other reasoning behind the ordering
> used (which is different from the one in the manpage)?
>
>>   	NULL
>>   };
>>   
>> @@ -24,12 +24,13 @@ static const char * const builtin_commit_graph_read_usage[] = {
>>   };
>>   
>>   static const char * const builtin_commit_graph_write_usage[] = {
>> -	N_("git commit-graph write [--object-dir <objdir>] [--append] [--stdin-packs|--stdin-commits]"),
>> +	N_("git commit-graph write [--object-dir <objdir>] [--append] [--reachable|--stdin-packs|--stdin-commits]"),
> The same.
>
>>   	NULL
>>   };
>>   
>>   static struct opts_commit_graph {
>>   	const char *obj_dir;
>> +	int reachable;
>>   	int stdin_packs;
>>   	int stdin_commits;
>>   	int append;
>> @@ -130,6 +131,8 @@ static int graph_write(int argc, const char **argv)
>>   		OPT_STRING(0, "object-dir", &opts.obj_dir,
>>   			N_("dir"),
>>   			N_("The object directory to store the graph")),
>> +		OPT_BOOL(0, "reachable", &opts.reachable,
>> +			N_("start walk at all refs")),
> Errr... does '--no-reachable' makes sense?  Because if I am right
> currently it is supported, isn't it.

True, but the same holds for many arguments in the codebase. Just 
looking at 'git apply' as a sample, it uses OPT_BOOL for "numstat", 
"summary", "check", "index", "cached", "apply", and "no-add" (so 
"--no-no-add" works?). I think accepting "--no-reachable" is fine, since 
it will set opts.reachable = 0 and we will not use the reachable option.

>
>>   		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
>>   			N_("scan pack-indexes listed by stdin for commits")),
>>   		OPT_BOOL(0, "stdin-commits", &opts.stdin_commits,
>> @@ -143,11 +146,16 @@ static int graph_write(int argc, const char **argv)
>>   			     builtin_commit_graph_write_options,
>>   			     builtin_commit_graph_write_usage, 0);
>>   
>> -	if (opts.stdin_packs && opts.stdin_commits)
>> -		die(_("cannot use both --stdin-commits and --stdin-packs"));
>> +	if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
> Nice trick.
>
>> +		die(_("use at most one of --reachable, --stdin-commits, or --stdin-packs"));
> It is a pity that parseopt API does not have direct support for mutually
> exclusive groups of boolean options, like ArgumentParser.add_mutually_exclusive_group()
> in Python's argparse.
>
> Still, you need to use what it is there.
>
>>   	if (!opts.obj_dir)
>>   		opts.obj_dir = get_object_directory();
>>   
>> +	if (opts.reachable) {
>> +		write_commit_graph_reachable(opts.obj_dir, opts.append);
>> +		return 0;
>> +	}
> Just using the option.
>
>> +
>>   	if (opts.stdin_packs || opts.stdin_commits) {
>>   		struct strbuf buf = STRBUF_INIT;
>>   		lines_nr = 0;
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a33600c584..057d734926 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -6,6 +6,7 @@
>>   #include "packfile.h"
>>   #include "commit.h"
>>   #include "object.h"
>> +#include "refs.h"
>>   #include "revision.h"
>>   #include "sha1-lookup.h"
>>   #include "commit-graph.h"
>> @@ -651,6 +652,37 @@ static void compute_generation_numbers(struct packed_commit_list* commits)
>>   	}
>>   }
>>   
>> +struct hex_list {
>> +	char **hex_strs;
>> +	int hex_nr;
>> +	int hex_alloc;
>> +};
> Is this what git-for-each-ref / git-branch / git-tag uses?
>
> Would it be possible to use for example string-list API (documented in
> string-list.h) instead?  Anyway, it looks like the use of allocation
> growing API is simple enough... though perhaps it could be made simpler
> by noticing that all strings have the same width.

I should use the string-list API for more than just this list of 
strings. I'll add a new commit that replaces the `const char **` 
parameters from write_commit_graph(), then this commit becomes much simpler.

>
>> +
>> +static int add_ref_to_list(const char *refname,
>> +			   const struct object_id *oid,
>> +			   int flags, void *cb_data)
>> +{
>> +	struct hex_list *list = (struct hex_list*)cb_data;
>> +
>> +	ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
>> +	list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
>> +	strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
> Wouldn't it be better to use strdup or xstrdup instead of
> xcalloc+strcpy?
>
>> +	list->hex_nr++;
>> +	return 0;
>> +}
>> +
>> +void write_commit_graph_reachable(const char *obj_dir, int append)
>> +{
>> +	struct hex_list list;
>> +	list.hex_nr = 0;
>> +	list.hex_alloc = 128;
>> +	ALLOC_ARRAY(list.hex_strs, list.hex_alloc);
>> +
>> +	for_each_ref(add_ref_to_list, &list);
>> +
>> +	write_commit_graph(obj_dir, NULL, 0, (const char **)list.hex_strs, list.hex_nr, append);
> Where do we free the allocated data and allocated strings?  If they are
> cleaned by process exit, perhaps they need to be UNLEAK-ed?
>
>> +}
>> +
>>   void write_commit_graph(const char *obj_dir,
>>   			const char **pack_indexes,
>>   			int nr_packs,
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 71a39c5a57..9a06a5f188 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -46,6 +46,7 @@ struct commit_graph {
>>   
>>   struct commit_graph *load_commit_graph_one(const char *graph_file);
>>   
>> +void write_commit_graph_reachable(const char *obj_dir, int append);
>>   void write_commit_graph(const char *obj_dir,
>>   			const char **pack_indexes,
>>   			int nr_packs,
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 4941937163..a659620332 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -205,6 +205,16 @@ test_expect_success 'build graph from commits with append' '
>>   graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
>>   graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
>>   
>> +test_expect_success 'build graph using --reachable' '
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	git commit-graph write --reachable &&
>> +	test_path_is_file $objdir/info/commit-graph &&
>> +	graph_read_expect "11" "large_edges"
>> +'
> All right, here we check that commit-graph has expected features (11
> commits, and large_edges optional chunk).
>
> Perhaps we could also check that different equivalent ways of creating
> serialized commit graph file produce byte-for-byte identical file?
>
>> +
>> +graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
>> +graph_git_behavior 'append graph, commit 8 vs merge 2' full commits/8 merge/2
> All right, this supposedly tests that behavior does not change whether
> we are using or we are not using the commit-graph feature... but I have
> just noticed that raph_git_two_modes() uses `git -c core.graph=true`;
> shouldn't it be `git -c core.commitGraph=true`?

WOW good catch. I've sent a patch to fix that immediately, as these 
tests are not doing what they should (verify the core.commitGraph 
setting does not change the output of the walk).

>
>> +
>>   test_expect_success 'setup bare repo' '
>>   	cd "$TRASH_DIRECTORY" &&
>>   	git clone --bare --no-local full bare &&


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 19/20] gc: automatically write commit-graph files
  2018-06-02 18:03         ` Jakub Narebski
@ 2018-06-04 12:51           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 12:51 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 2:03 PM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The commit-graph file is a very helpful feature for speeding up git
>> operations. In order to make it more useful, write the commit-graph file
>> by default during standard garbage collection operations.
> I think you meant here "make it possible to write the commit-graph file
> during standard garbage collection operations." (i.e. add "make it
> possible" because it hides behind new config option, and remove "by
> default" because currently it is not turned on by default).
>
>> Add a 'gc.commitGraph' config setting that triggers writing a
>> commit-graph file after any non-trivial 'git gc' command. Defaults to
>> false while the commit-graph feature matures. We specifically do not
>> want to turn this on by default until the commit-graph feature is fully
> s/turn this on/have this on/  I think.
>
>> integrated with history-modifying features like shallow clones.
> Two things.
>
> First, shallow clones, replacement mechanims (git-replace) and grafts
> are not "history-modifying" features; this name is in my opinion
> reserved for history-rewriting features such as interactive rebase, the
> `git filter-branch` feature or out-of-tree BFG Repo Cleaner or
> reposurgeon tools.  They alter the _view_ of history; they should be
> IMVHO named "history-view-altering" features -- though I agree this is
> mouthful.
>
> Second, shouldn't we, as Martin Ågren said, warn about the issue in the
> manpage for git-commit-graph?
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   Documentation/config.txt |  6 ++++++
>>   Documentation/git-gc.txt |  4 ++++
>>   builtin/gc.c             |  6 ++++++
>>   t/t5318-commit-graph.sh  | 14 ++++++++++++++
>>   4 files changed, 30 insertions(+)
>>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index 11f027194e..9a3abd87e7 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -1553,6 +1553,12 @@ gc.autoDetach::
>>   	Make `git gc --auto` return immediately and run in background
>>   	if the system supports it. Default is true.
>>   
>> +gc.commitGraph::
>> +	If true, then gc will rewrite the commit-graph file after any
>> +	change to the object database. If '--auto' is used, then the
>> +	commit-graph will not be updated unless the threshold is met.
> What threshold?  Ah, thresholds defined for `git gc --auto` (gc.auto,
> gc.autoPackLimit, gc.logExpiry,...).
>
>> +	See linkgit:git-commit-graph[1] for details.
> You missed declaring the default value for this config option.
>
>> +
>>   gc.logExpiry::
>>   	If the file gc.log exists, then `git gc --auto` won't run
>>   	unless that file is more than 'gc.logExpiry' old.  Default is
>> diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
>> index 571b5a7e3c..17dd654a59 100644
>> --- a/Documentation/git-gc.txt
>> +++ b/Documentation/git-gc.txt
>> @@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` determines if
>>   it within all non-bare repos or it can be set to a boolean value.
>>   This defaults to true.
>>   
>> +The optional configuration variable 'gc.commitGraph' determines if
>> +'git gc' runs 'git commit-graph write'. This can be set to a boolean
> Should it be "runs" or "should run"?
>
>> +value. This defaults to false.
> Should it be '...' or `...`?  Below we have `gc.aggresiveWindow`, above
> we have 'gc.commitGraph', for example.
>
>> +
>>   The optional configuration variable `gc.aggressiveWindow` controls how
>>   much time is spent optimizing the delta compression of the objects in
>>   the repository when the --aggressive option is specified.  The larger
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index 77fa720bd0..efd214a59f 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -20,6 +20,7 @@
>>   #include "argv-array.h"
>>   #include "commit.h"
>>   #include "packfile.h"
>> +#include "commit-graph.h"
>>   
>>   #define FAILED_RUN "failed to run %s"
>>   
>> @@ -34,6 +35,7 @@ static int aggressive_depth = 50;
>>   static int aggressive_window = 250;
>>   static int gc_auto_threshold = 6700;
>>   static int gc_auto_pack_limit = 50;
>> +static int gc_commit_graph = 0;
>>   static int detach_auto = 1;
>>   static timestamp_t gc_log_expire_time;
>>   static const char *gc_log_expire = "1.day.ago";
>> @@ -121,6 +123,7 @@ static void gc_config(void)
>>   	git_config_get_int("gc.aggressivedepth", &aggressive_depth);
>>   	git_config_get_int("gc.auto", &gc_auto_threshold);
>>   	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
>> +	git_config_get_bool("gc.commitgraph", &gc_commit_graph);
>>   	git_config_get_bool("gc.autodetach", &detach_auto);
>>   	git_config_get_expiry("gc.pruneexpire", &prune_expire);
>>   	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
>> @@ -480,6 +483,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>>   	if (pack_garbage.nr > 0)
>>   		clean_pack_garbage();
>>   
>> +	if (gc_commit_graph)
>> +		write_commit_graph_reachable(get_object_directory(), 0);
>> +
> Nice.
>
> Though now I wonder when appending should be used...

Appending is probably useless in the 'reachable' case, but is valuable 
in the '--stdin-packs' case (which is what we use in GVFS to maintain 
the commit-graph).

>
>>   	if (auto_gc && too_many_loose_objects())
>>   		warning(_("There are too many unreachable loose objects; "
>>   			"run 'git prune' to remove them."));
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index a659620332..d20b17586f 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -245,6 +245,20 @@ test_expect_success 'perform fast-forward merge in full repo' '
>>   	test_cmp expect output
>>   '
>>   
>> +test_expect_success 'check that gc clears commit-graph' '
> I wouldn't use the word "clears" here...
>
>> +	cd "$TRASH_DIRECTORY/full" &&
>> +	git commit --allow-empty -m "blank" &&
>> +	git commit-graph write --reachable &&
>> +	cp $objdir/info/commit-graph commit-graph-before-gc &&
>> +	git reset --hard HEAD~1 &&
>> +	git config gc.commitGraph true &&
>> +	git gc &&
>> +	cp $objdir/info/commit-graph commit-graph-after-gc &&
>> +	! test_cmp commit-graph-before-gc commit-graph-after-gc &&
>> +	git commit-graph write --reachable &&
>> +	test_cmp commit-graph-after-gc $objdir/info/commit-graph
>> +'
> ...but otherwise, nice test: it checks that git-gc after rewriting
> history changes commit-graph file, and that the changed file is what we
> expect it to be (note: here we compare commit-graph files directly, and
> not just check the features via 'git commit-graph read').
>
>> +
>>   # the verify tests below expect the commit-graph to contain
>>   # exactly the commits reachable from the commits/8 branch.
>>   # If the file changes the set of commits in the list, then the


^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 15/20] commit-graph: test for corrupted octopus edge
  2018-06-02 12:39         ` Jakub Narebski
@ 2018-06-04 13:08           ` Derrick Stolee
  0 siblings, 0 replies; 149+ messages in thread
From: Derrick Stolee @ 2018-06-04 13:08 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, avarab, marten.agren, peff

On 6/2/2018 8:39 AM, Jakub Narebski wrote:
> Derrick Stolee <dstolee@microsoft.com> writes:
>
>> The commit-graph file has an extra chunk to store the parent int-ids for
>> parents beyond the first parent for octopus merges. Our test repo has a
>> single octopus merge that we can manipulate to demonstrate the 'verify'
>> subcommand detects incorrect values in that chunk.
> If I understand it correctly the above means that our _reading_ code
> checks for validity (which then 'git commit-graph verify' uses), just
> there were not any tests for that.
>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   t/t5318-commit-graph.sh | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
>> index 58adb8246d..240aef6add 100755
>> --- a/t/t5318-commit-graph.sh
>> +++ b/t/t5318-commit-graph.sh
>> @@ -248,6 +248,7 @@ test_expect_success 'git commit-graph verify' '
>>   '
>>   
>>   NUM_COMMITS=9
>> +NUM_OCTOPUS_EDGES=2
>>   HASH_LEN=20
>>   GRAPH_BYTE_VERSION=4
>>   GRAPH_BYTE_HASH=5
>> @@ -274,6 +275,10 @@ GRAPH_BYTE_COMMIT_EXTRA_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 4`
>>   GRAPH_BYTE_COMMIT_WRONG_PARENT=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 3`
>>   GRAPH_BYTE_COMMIT_GENERATION=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 8`
>>   GRAPH_BYTE_COMMIT_DATE=`expr $GRAPH_COMMIT_DATA_OFFSET + $HASH_LEN + 12`
>> +GRAPH_COMMIT_DATA_WIDTH=`expr $HASH_LEN + 16`
>> +GRAPH_OCTOPUS_DATA_OFFSET=`expr $GRAPH_COMMIT_DATA_OFFSET + \
>> +				$GRAPH_COMMIT_DATA_WIDTH \* $NUM_COMMITS`
>> +GRAPH_BYTE_OCTOPUS=`expr $GRAPH_OCTOPUS_DATA_OFFSET + 4`
>>   
>>   # usage: corrupt_graph_and_verify <position> <data> <string>
>>   # Manipulates the commit-graph file at the position
>> @@ -378,4 +383,9 @@ test_expect_success 'detect incorrect commit date' '
>>   		"commit date"
>>   '
>>   
>> +test_expect_success 'detect incorrect parent for octopus merge' '
>> +	corrupt_graph_and_verify $GRAPH_BYTE_OCTOPUS "\01" \
>> +		"invalid parent"
>> +'
> So we change the int-id to non-existing commit, and check that
> commit-graph code checks for that.
>
> What about the case when there are octopus merges, but no EDGE chunk
> (which I think we can emulate by changing / corrupting number of
> chunks)?
>
> What about the case where int-id of edge in EDGE chunk is correct, that
> is points to a valid commit, but does not agree with what is in the
> object database (what parents octopus merge has in reality)?
>
> Do we detect the situation where the second parent value in the commit
> data stores an array position within a Large Edge chunk, but we do not
> reach a value with the most-significant bit on when reaching the end of
> Large Edge chunk?

There are a few holes like this, but I think they are better suited to a 
follow-up series, as this series is already quite large.

-Stolee

^ permalink raw reply	[flat|nested] 149+ messages in thread

* Re: [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup
  2018-06-04 11:32           ` Derrick Stolee
@ 2018-06-04 14:42             ` Duy Nguyen
  0 siblings, 0 replies; 149+ messages in thread
From: Duy Nguyen @ 2018-06-04 14:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee, git, gitster, jnareb, avarab, marten.agren, peff

On Mon, Jun 4, 2018 at 1:32 PM, Derrick Stolee <stolee@gmail.com> wrote:
> On 6/2/2018 12:38 AM, Duy Nguyen wrote:
>>
>> On Thu, May 24, 2018 at 6:25 PM, Derrick Stolee <dstolee@microsoft.com>
>> wrote:
>>>
>>> +               if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
>>> +                       graph_report("commit-graph has incorrect OID
>>> order: %s then %s",
>>> +                                    oid_to_hex(&prev_oid),
>>> +                                    oid_to_hex(&cur_oid));
>>
>> Should these strings be marked for translation with _()?
>
> I've been asking myself "Is this message helpful to anyone other than a Git
> developer?" and for this series the only one that is helpful to an end-user
> is the message about the final hash. If the hash is correct, but these other
> messages appear, then there is a bug in the code that wrote the file.
> Otherwise, file corruption is more likely and the correct course of action
> is to delete and rebuild.

Dev-only strings like this are typically prefixed with "BUG:" or
"internal error:" (unless BUG() is a better choice). Git is
unfortunately not fully i18n-ized and devs from time to time still
forget to mark string for translations when appropriate, including me.
Because of this, we still have to slowly scan through the code base
and mark more strings for translation. Something to say clearly "not
translatable on purpose" would help a lot. If "BUG:" and friends are
too much noise, a /* no translate */ comment or some other form could
also help.

But your explanation to me still sounds like corrupted file in some
form, which should be translated unless it's too cryptic. commit-graph
format may be available in non-English languages and people can still
try to figure out the problem without relying entirely on git
developers.
-- 
Duy

^ permalink raw reply	[flat|nested] 149+ messages in thread

end of thread, back to index

Thread overview: 149+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-17 18:10 [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
2018-04-17 18:10 ` [RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information Derrick Stolee
2018-04-17 18:10 ` [RFC PATCH 02/12] commit-graph: add 'check' subcommand Derrick Stolee
2018-04-19 13:24   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 03/12] commit-graph: check file header information Derrick Stolee
2018-04-19 15:58   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
2018-04-19 17:21   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 05/12] commit-graph: check fanout and lookup table Derrick Stolee
2018-04-20  7:27   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 06/12] commit: force commit to parse from object database Derrick Stolee
2018-04-20 12:13   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
2018-04-20 12:18   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 08/12] commit-graph: verify commit contents against odb Derrick Stolee
2018-04-20 16:47   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 10/12] commit-graph: add '--reachable' option Derrick Stolee
2018-04-20 17:17   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 09/12] fsck: check commit-graph Derrick Stolee
2018-04-20 16:59   ` Jakub Narebski
2018-04-17 18:10 ` [RFC PATCH 11/12] gc: automatically write commit-graph files Derrick Stolee
2018-04-20 17:34   ` Jakub Narebski
2018-04-20 18:33     ` Ævar Arnfjörð Bjarmason
2018-04-17 18:10 ` [RFC PATCH 12/12] commit-graph: update design document Derrick Stolee
2018-04-20 19:10   ` Jakub Narebski
2018-04-17 18:50 ` [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
2018-05-10 17:34 ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Derrick Stolee
2018-05-10 17:34   ` [PATCH 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
2018-05-10 18:15     ` Martin Ågren
2018-05-10 17:34   ` [PATCH 02/12] commit-graph: verify file header information Derrick Stolee
2018-05-10 18:21     ` Martin Ågren
2018-05-10 17:34   ` [PATCH 03/12] commit-graph: parse commit from chosen graph Derrick Stolee
2018-05-10 17:34   ` [PATCH 04/12] commit-graph: verify fanout and lookup table Derrick Stolee
2018-05-10 18:29     ` Martin Ågren
2018-05-11 15:17       ` Derrick Stolee
2018-05-10 17:34   ` [PATCH 05/12] commit: force commit to parse from object database Derrick Stolee
2018-05-10 17:34   ` [PATCH 06/12] commit-graph: load a root tree from specific graph Derrick Stolee
2018-05-10 17:34   ` [PATCH 07/12] commit-graph: verify commit contents against odb Derrick Stolee
2018-05-10 17:34   ` [PATCH 08/12] fsck: verify commit-graph Derrick Stolee
2018-05-10 17:34   ` [PATCH 09/12] commit-graph: add '--reachable' option Derrick Stolee
2018-05-10 17:34   ` [PATCH 10/12] gc: automatically write commit-graph files Derrick Stolee
2018-05-10 17:34   ` [PATCH 11/12] fetch: compute commit-graph by default Derrick Stolee
2018-05-10 17:34   ` [PATCH 12/12] commit-graph: update design document Derrick Stolee
2018-05-10 19:05   ` [PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch Martin Ågren
2018-05-10 19:22     ` Stefan Beller
2018-05-11 17:23       ` Derrick Stolee
2018-05-11 17:30         ` Martin Ågren
2018-05-10 19:17   ` Ævar Arnfjörð Bjarmason
2018-05-11 17:23     ` Derrick Stolee
2018-05-11 21:15   ` [PATCH v2 00/12] Integrate commit-graph into fsck and gc Derrick Stolee
2018-05-11 21:15     ` [PATCH v2 01/12] commit-graph: add 'verify' subcommand Derrick Stolee
2018-05-12 13:31       ` Martin Ågren
2018-05-14 13:27         ` Derrick Stolee
2018-05-20 12:10       ` Jakub Narebski
2018-05-11 21:15     ` [PATCH v2 02/12] commit-graph: verify file header information Derrick Stolee
2018-05-12 13:35       ` Martin Ågren
2018-05-14 13:31         ` Derrick Stolee
2018-05-20 20:00       ` Jakub Narebski
2018-05-11 21:15     ` [PATCH v2 03/12] commit-graph: test that 'verify' finds corruption Derrick Stolee
2018-05-12 13:43       ` Martin Ågren
2018-05-21 18:53       ` Jakub Narebski
2018-05-24 16:28         ` Derrick Stolee
2018-05-11 21:15     ` [PATCH v2 04/12] commit-graph: parse commit from chosen graph Derrick Stolee
2018-05-12 20:50       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 05/12] commit-graph: verify fanout and lookup table Derrick Stolee
2018-05-11 21:15     ` [PATCH v2 06/12] commit: force commit to parse from object database Derrick Stolee
2018-05-12 20:54       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 07/12] commit-graph: load a root tree from specific graph Derrick Stolee
2018-05-12 20:55       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 08/12] commit-graph: verify commit contents against odb Derrick Stolee
2018-05-12 21:17       ` Martin Ågren
2018-05-14 13:44         ` Derrick Stolee
2018-05-15 21:12       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 09/12] fsck: verify commit-graph Derrick Stolee
2018-05-17 18:13       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 10/12] commit-graph: add '--reachable' option Derrick Stolee
2018-05-17 18:16       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 11/12] gc: automatically write commit-graph files Derrick Stolee
2018-05-17 18:20       ` Martin Ågren
2018-05-11 21:15     ` [PATCH v2 12/12] commit-graph: update design document Derrick Stolee
2018-05-24 16:25     ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 01/20] commit-graph: UNLEAK before die() Derrick Stolee
2018-05-24 22:47         ` Stefan Beller
2018-05-25  0:08           ` Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 02/20] commit-graph: fix GRAPH_MIN_SIZE Derrick Stolee
2018-05-26 18:46         ` Jakub Narebski
2018-05-26 20:30           ` brian m. carlson
2018-06-02 19:43             ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 03/20] commit-graph: parse commit from chosen graph Derrick Stolee
2018-05-27 10:23         ` Jakub Narebski
2018-05-29 12:31           ` Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 04/20] commit: force commit to parse from object database Derrick Stolee
2018-05-27 18:04         ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 05/20] commit-graph: load a root tree from specific graph Derrick Stolee
2018-05-27 19:12         ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 06/20] commit-graph: add 'verify' subcommand Derrick Stolee
2018-05-27 22:55         ` Jakub Narebski
2018-05-30 16:07           ` Derrick Stolee
2018-06-02 21:19             ` Jakub Narebski
2018-06-04 11:30               ` Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 07/20] commit-graph: verify catches corrupt signature Derrick Stolee
2018-05-28 14:05         ` Jakub Narebski
2018-05-29 12:43           ` Derrick Stolee
2018-06-02 22:30             ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 08/20] commit-graph: verify required chunks are present Derrick Stolee
2018-05-28 17:11         ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 09/20] commit-graph: verify corrupt OID fanout and lookup Derrick Stolee
2018-05-30 13:34         ` Jakub Narebski
2018-05-30 16:18           ` Derrick Stolee
2018-06-02  4:38         ` Duy Nguyen
2018-06-04 11:32           ` Derrick Stolee
2018-06-04 14:42             ` Duy Nguyen
2018-05-24 16:25       ` [PATCH v3 10/20] commit-graph: verify objects exist Derrick Stolee
2018-05-30 19:22         ` Jakub Narebski
2018-05-31 12:53           ` Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 11/20] commit-graph: verify root tree OIDs Derrick Stolee
2018-05-30 22:24         ` Jakub Narebski
2018-05-31 13:16           ` Derrick Stolee
2018-06-02 22:50             ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 12/20] commit-graph: verify parent list Derrick Stolee
2018-06-01 23:21         ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 13/20] commit-graph: verify generation number Derrick Stolee
2018-06-02 12:23         ` Jakub Narebski
2018-06-04 11:47           ` Derrick Stolee
2018-05-24 16:25       ` [PATCH v3 14/20] commit-graph: verify commit date Derrick Stolee
2018-06-02 12:29         ` Jakub Narebski
2018-05-24 16:25       ` [PATCH v3 15/20] commit-graph: test for corrupted octopus edge Derrick Stolee
2018-06-02 12:39         ` Jakub Narebski
2018-06-04 13:08           ` Derrick Stolee
2018-05-24 16:26       ` [PATCH v3 16/20] commit-graph: verify contents match checksum Derrick Stolee
2018-05-30 12:35         ` SZEDER Gábor
2018-06-02 15:52         ` Jakub Narebski
2018-06-04 11:55           ` Derrick Stolee
2018-05-24 16:26       ` [PATCH v3 17/20] fsck: verify commit-graph Derrick Stolee
2018-06-02 16:17         ` Jakub Narebski
2018-06-04 11:59           ` Derrick Stolee
2018-05-24 16:26       ` [PATCH v3 18/20] commit-graph: add '--reachable' option Derrick Stolee
2018-06-02 17:34         ` Jakub Narebski
2018-06-04 12:44           ` Derrick Stolee
2018-05-24 16:26       ` [PATCH v3 19/20] gc: automatically write commit-graph files Derrick Stolee
2018-06-02 18:03         ` Jakub Narebski
2018-06-04 12:51           ` Derrick Stolee
2018-05-24 16:26       ` [PATCH v3 20/20] commit-graph: update design document Derrick Stolee
2018-06-02 18:27         ` Jakub Narebski
2018-05-24 21:15       ` [PATCH v3 00/20] Integrate commit-graph into 'fsck' and 'gc' Ævar Arnfjörð Bjarmason
2018-05-25  4:11       ` Junio C Hamano
2018-05-29  4:27       ` Junio C Hamano
2018-05-29 12:37         ` Derrick Stolee
2018-05-29 13:41           ` Junio C Hamano

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox